why there is not VACUUM FULL CONCURRENTLY?
Hi
I have one question, what is a block of implementation of some variant of
VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX
CONCURRENTLY cannot be used for VACUUM FULL?
Regards
Pavel
On Tue, Jan 30, 2024 at 09:01:57AM +0100, Pavel Stehule wrote:
I have one question, what is a block of implementation of some variant of
VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX
CONCURRENTLY cannot be used for VACUUM FULL?
You may be interested in these threads:
/messages/by-id/CAB7nPqTGmNUFi+W6F1iwmf7J-o6sY+xxo6Yb=mkUVYT-CG-B5A@mail.gmail.com
/messages/by-id/CAB7nPqTys6JUQDxUczbJb0BNW0kPrW8WdZuk11KaxQq6o98PJg@mail.gmail.com
VACUUM FULL is CLUSTER under the hoods. One may question whether it
is still a relevant discussion these days if we assume that autovacuum
is able to keep up, because it always keeps up with the house cleanup,
right? ;)
More seriously, we have a lot more options these days with VACUUM like
PARALLEL, so CONCURRENTLY may still have some uses, but the new toys
available may have changed things. So, would it be worth the
complexities around heap manipulations that lower locks would require?
--
Michael
út 30. 1. 2024 v 9:14 odesílatel Michael Paquier <michael@paquier.xyz>
napsal:
On Tue, Jan 30, 2024 at 09:01:57AM +0100, Pavel Stehule wrote:
I have one question, what is a block of implementation of some variant of
VACUUM FULL like REINDEX CONCURRENTLY? Why similar mechanism of REINDEX
CONCURRENTLY cannot be used for VACUUM FULL?You may be interested in these threads:
/messages/by-id/CAB7nPqTGmNUFi+W6F1iwmf7J-o6sY+xxo6Yb=mkUVYT-CG-B5A@mail.gmail.com
/messages/by-id/CAB7nPqTys6JUQDxUczbJb0BNW0kPrW8WdZuk11KaxQq6o98PJg@mail.gmail.com
VACUUM FULL is CLUSTER under the hoods. One may question whether it
is still a relevant discussion these days if we assume that autovacuum
is able to keep up, because it always keeps up with the house cleanup,
right? ;)More seriously, we have a lot more options these days with VACUUM like
PARALLEL, so CONCURRENTLY may still have some uses, but the new toys
available may have changed things. So, would it be worth the
complexities around heap manipulations that lower locks would require?
One of my customer today is reducing one table from 140GB to 20GB. Now he
is able to run archiving. He should play with pg_repack, and it is working
well today, but I ask myself, what pg_repack does not be hard to do
internally because it should be done for REINDEX CONCURRENTLY. This is not
a common task, and not will be, but on the other hand, it can be nice to
have feature, and maybe not too hard to implement today. But I didn't try it
I'll read the threads
Pavel
--
Show quoted text
Michael
On 2024-Jan-30, Pavel Stehule wrote:
One of my customer today is reducing one table from 140GB to 20GB. Now he
is able to run archiving. He should play with pg_repack, and it is working
well today, but I ask myself, what pg_repack does not be hard to do
internally because it should be done for REINDEX CONCURRENTLY. This is not
a common task, and not will be, but on the other hand, it can be nice to
have feature, and maybe not too hard to implement today. But I didn't try it
FWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.
So thinking about your question, I think it might be possible to
integrate a tool that works like pg_squeeze, such that it runs when
VACUUM is invoked -- either under some new option, or just replace the
code under FULL, not sure. If the Cybertec people allows it, we could
just grab the pg_squeeze code and add it to the things that VACUUM can
run.
Now, pg_squeeze has some additional features, such as periodic
"squeezing" of tables. In a first attempt, for simplicity, I would
leave that stuff out and just allow it to run from the user invoking it,
and then have the command to do a single run. (The scheduling features
could be added later, or somehow integrated into autovacuum, or maybe
something else.)
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"We're here to devour each other alive" (Hobbes)
út 30. 1. 2024 v 11:31 odesílatel Alvaro Herrera <alvherre@alvh.no-ip.org>
napsal:
On 2024-Jan-30, Pavel Stehule wrote:
One of my customer today is reducing one table from 140GB to 20GB. Now
he
is able to run archiving. He should play with pg_repack, and it is
working
well today, but I ask myself, what pg_repack does not be hard to do
internally because it should be done for REINDEX CONCURRENTLY. This isnot
a common task, and not will be, but on the other hand, it can be nice to
have feature, and maybe not too hard to implement today. But I didn'ttry it
FWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.So thinking about your question, I think it might be possible to
integrate a tool that works like pg_squeeze, such that it runs when
VACUUM is invoked -- either under some new option, or just replace the
code under FULL, not sure. If the Cybertec people allows it, we could
just grab the pg_squeeze code and add it to the things that VACUUM can
run.Now, pg_squeeze has some additional features, such as periodic
"squeezing" of tables. In a first attempt, for simplicity, I would
leave that stuff out and just allow it to run from the user invoking it,
and then have the command to do a single run. (The scheduling features
could be added later, or somehow integrated into autovacuum, or maybe
something else.)
some basic variant (without autovacuum support) can be good enough. We have
no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity
for it (sure, it can be limited by my perspective) . The necessity of
reducing table size is not too common (a lot of use cases are better
covered by using partitioning), but sometimes it is, and then buildin
simple available solution can be helpful.
Show quoted text
--
Álvaro Herrera PostgreSQL Developer —
https://www.EnterpriseDB.com/
"We're here to devour each other alive" (Hobbes)
On 2024-Jan-30, Pavel Stehule wrote:
some basic variant (without autovacuum support) can be good enough. We have
no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity
for it (sure, it can be limited by my perspective) . The necessity of
reducing table size is not too common (a lot of use cases are better
covered by using partitioning), but sometimes it is, and then buildin
simple available solution can be helpful.
That's my thinking as well.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Tue, Jan 30, 2024 at 12:37:12PM +0100, Alvaro Herrera wrote:
On 2024-Jan-30, Pavel Stehule wrote:
some basic variant (without autovacuum support) can be good enough. We have
no autovacuum support for REINDEX CONCURRENTLY and I don't see a necessity
for it (sure, it can be limited by my perspective) . The necessity of
reducing table size is not too common (a lot of use cases are better
covered by using partitioning), but sometimes it is, and then buildin
simple available solution can be helpful.That's my thinking as well.
Or, yes, I'd agree about that. This can make for a much better user
experience. I'm just not sure how that stuff would be shaped and how
much ground it would need to cover.
--
Michael
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Jan-30, Pavel Stehule wrote:
One of my customer today is reducing one table from 140GB to 20GB. Now he
is able to run archiving. He should play with pg_repack, and it is working
well today, but I ask myself, what pg_repack does not be hard to do
internally because it should be done for REINDEX CONCURRENTLY. This is not
a common task, and not will be, but on the other hand, it can be nice to
have feature, and maybe not too hard to implement today. But I didn't try itFWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.So thinking about your question, I think it might be possible to
integrate a tool that works like pg_squeeze, such that it runs when
VACUUM is invoked -- either under some new option, or just replace the
code under FULL, not sure. If the Cybertec people allows it, we could
just grab the pg_squeeze code and add it to the things that VACUUM can
run.
There are no objections from Cybertec. Nevertheless, I don't expect much code
to be just copy & pasted. If I started to implement the extension today, I'd
do some things in a different way. (Some things might actually be simpler in
the core, i.e. a few small changes in PG core are easier than the related
workarounds in the extension.)
The core idea is that: 1) a "historic snapshot" is used to get the current
contents of the table, 2) logical decoding is used to capture the changes done
while the data is being copied to new storage, 3) the exclusive lock on the
table is only taken for very short time, to swap the storage (relfilenode) of
the table.
I think it should be coded in a way that allows use by VACUUM FULL, CLUSTER,
and possibly some subcommands of ALTER TABLE. For example, some users of
pg_squeeze requested an enhancement that allows the user to change column data
type w/o service disruption (typically when it appears that integer type is
going to overflow and change bigint is needed).
Online (re)partitioning could be another use case, although I admit that
commands that change the system catalog are a bit harder to implement than
VACUUM FULL / CLUSTER.
One thing that pg_squeeze does not handle is visibility: it uses heap_insert()
to insert the tuples into the new storage, so the problems described in [1]https://www.postgresql.org/docs/current/mvcc-caveats.html
can appear. The in-core implementation should rather do something like tuple
rewriting (rewriteheap.c).
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)
[1]: https://www.postgresql.org/docs/current/mvcc-caveats.html
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
This is great to hear.
On 2024-Jan-31, Antonin Houska wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)
I don't have plans for it, so if you have resources, please go for it.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
This is great to hear.
On 2024-Jan-31, Antonin Houska wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)I don't have plans for it, so if you have resources, please go for it.
ok, I'm thinking how can the feature be integrated into the core.
BTW, I'm failing to understand why cluster_rel() has no argument of the
BufferAccessStrategy type. According to buffer/README, the criterion for using
specific strategy is that page "is unlikely to be needed again
soon". Specifically for cluster_rel(), the page will *definitely* not be used
again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
relatin file number and the old relation file is eventually dropped.
Am I missing anything?
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2024-Feb-16, Antonin Houska wrote:
BTW, I'm failing to understand why cluster_rel() has no argument of the
BufferAccessStrategy type. According to buffer/README, the criterion for using
specific strategy is that page "is unlikely to be needed again
soon". Specifically for cluster_rel(), the page will *definitely* not be used
again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
relatin file number and the old relation file is eventually dropped.Am I missing anything?
No, that's just an oversight. Access strategies are newer than that
cluster code.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
sources, so let's move on." (Nathaniel Smith)
https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)I don't have plans for it, so if you have resources, please go for it.
The first version is attached. The actual feature is in 0003. 0004 is probably
not necessary now, but I haven't realized until I coded it.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v01-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From f47a98b9b4580a581aacf73c553b87ca6bf16533 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 1/4] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..194d143cf4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ea05d4b224..488ca950d9 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -296,7 +296,7 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
dest = CreateTransientRelDestReceiver(OIDNewHeap);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dbfe0d6b1c..5d6151dad1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5841,7 +5841,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..0bd000acc5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2196,15 +2196,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v01-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From cdf67d933a56323c0e5ca77495f60017d398bbd5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 2/4] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index bfb9b7704b..e7c8bfba94 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 1ccf4c6d83..b54a35d91c 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 3876339ee1..fe09ae8f63 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7b7f6f59d0..11cdf7f95a 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v01-0003-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From 1cb536663c018d98faf349a680b773364b464026 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 3/4] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 114 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 80 +-
src/backend/access/heap/heapam_handler.c | 155 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/access/transam/xact.c | 52 +
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2618 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/replication/logical/decode.c | 58 +-
src/backend/replication/logical/snapbuild.c | 87 +-
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 321 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 6 +-
src/bin/psql/tab-complete.c | 5 +-
src/include/access/heapam.h | 19 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/tableam.h | 10 +
src/include/access/xact.h | 2 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 117 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 2 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 3 +
src/test/regress/expected/rules.out | 17 +-
44 files changed, 3876 insertions(+), 259 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 991f629907..fe1ba36f40 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5567,14 +5567,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5655,6 +5676,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..0fe4e9603b 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,105 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ chnages would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable
+ noticeable if too many data changes have been done to the table
+ while <command>CLUSTER</command> was waiting for the lock: those changes
+ must be processed before the files are swapped.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9857b35627..298cf7298d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91b20147a0..493c351d7f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2079,8 +2081,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * Currently we only pass HEAP_INSERT_NO_LOGICAL when doing VACUUM
+ * FULL / CLUSTER, in which case the visibility information does not
+ * change. Therefore, there's no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
@@ -2624,7 +2631,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2681,11 +2689,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2702,6 +2710,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2927,7 +2936,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -2995,8 +3005,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3017,6 +3031,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3106,10 +3129,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3148,12 +3172,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3193,6 +3216,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3981,8 +4005,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3992,7 +4020,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4225,10 +4254,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8357,7 +8386,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8368,10 +8398,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..02fd6d2983 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static bool accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -250,7 +254,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -273,7 +278,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -307,7 +313,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -325,8 +332,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
@@ -686,6 +694,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -706,6 +716,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -786,6 +798,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -840,7 +853,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -856,14 +869,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -875,7 +889,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -889,8 +903,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -901,9 +913,39 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding. INSERT_IN_PROGRESS is rejected right away because
+ * our snapshot should represent a point in time which should precede
+ * (or be equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the tuple
+ * inserted.
+ */
+ if (concurrent &&
+ (vis == HEAPTUPLE_INSERT_IN_PROGRESS ||
+ !accept_tuple_for_concurrent_copy(tuple, snapshot, buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /*
+ * In the concurrent case, we should not unlock the buffer until the
+ * tuple has been copied to the new file: if a concurrent transaction
+ * marked it updated or deleted in between, we'd fail to replay that
+ * transaction's changes because then we'd try to perform the same
+ * UPDATE / DELETE twice. XXX Should we instead create a copy of the
+ * tuple so that the buffer can be unlocked right away?
+ */
+ if (!concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -920,7 +962,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -935,6 +977,35 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /* See the comment on unlocking above. */
+ if (concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -978,7 +1049,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2583,6 +2654,56 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Check if the tuple was inserted, updated or deleted while
+ * heapam_relation_copy_for_cluster() was copying the data.
+ *
+ * 'snapshot' is used to determine whether xmin/xmax was set by a transaction
+ * that is still in-progress, or one that started in the future from the
+ * snapshot perspective.
+ *
+ * Returns true if the insertion is visible to 'snapshot', but clear xmax if
+ * it was set by a transaction which is in-progress or in the future from the
+ * snapshot perspective. (The xmax will be set later, when we decode the
+ * corresponding UPDATE / DELETE from WAL.)
+ *
+ * Returns false if the insertion is not visible to 'snapshot'.
+ */
+static bool
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple should be rejected because it was inserted
+ * concurrently.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return false;
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple.
+ */
+ return true;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909d..f9b8cb4da7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -124,6 +124,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -970,6 +982,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -989,6 +1003,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5621,6 +5650,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a819b4197c..a25c84d7ae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1415,22 +1415,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1469,6 +1454,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..fddab1cfa9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1236,16 +1236,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 194d143cf4..6397f7f8c4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,175 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by CLUSTER CONCURRENTLY by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is that multiple backends might call cluster_rel(). This is
+ * not necessarily a correctness issue, but it definitely means wasted CPU
+ * time.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ Oid reltoastrelid;
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lock_mode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(Oid relid, bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ Snapshot snapshot,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +280,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lock_mode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +294,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +304,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lock_mode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lock_mode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +382,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +421,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lock_mode);
}
else
{
@@ -246,7 +430,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lock_mode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +447,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lock_mode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +468,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lock_mode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lock_mode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +504,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +527,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +598,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +613,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +624,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +635,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +651,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +682,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +697,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +711,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(tableOid, !success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +896,99 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("relation \"%s\" has insufficient replication identity",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("relation \"%s\" has no identity index",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,19 +997,83 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
LOCKMODE lmode_new;
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
+
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, indexOid, true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -673,31 +1092,52 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1288,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1320,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1347,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1428,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtrensaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1495,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1506,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1961,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +1993,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2272,1938 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(Oid relid, bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+
+ /* Remove the relation from the hash. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Disable end_concurrent_cluster_on_exit_callback(). */
+ if (OidIsValid(clustered_rel))
+ clustered_rel = InvalidOid;
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ else
+ key.relid = InvalidOid;
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it did for any reason, make sure the transaction is
+ * aborted: if other transactions, while changing the contents of the
+ * relation, didn't know that CLUSTER CONCURRENTLY was in progress, they
+ * could have missed to WAL enough information, and thus we could have
+ * produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Missing TOAST relation indicates that it could have been VACUUMed
+ * or CLUSTERed by another backend while we did not hold a lock on it.
+ */
+ if (entry_toast == NULL && OidIsValid(key.relid))
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(clustered_rel, true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /*
+ * TOAST should not really change, but be careful. If it did, we would be
+ * unable to remove the new one from ClusteredRelsHash.
+ */
+ if (OidIsValid(clustered_rel_toast) &&
+ clustered_rel_toast != reltoastrelid)
+ ereport(ERROR,
+ (errmsg("TOAST relation changed by another transaction")));
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_src, td_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should break
+ * anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_src = RelationGetDescr(index);
+ td_dst = palloc(TupleDescSize(td_src));
+ TupleDescCopy(td_dst, td_src);
+ result->ind_tupdescs[i] = td_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->reltoastrelid = reltoastrelid;
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (rel->rd_rel->reltoastrelid != cat_state->reltoastrelid)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ Snapshot snapshot;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * does not contain (sub)transactions other than the one whose
+ * data changes we are applying.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ ResetClusterCurrentXids();
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
+ {
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ Snapshot snapshot = change->snapshot;
+ List *recheck;
+
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ PushActiveSnapshot(snapshot);
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ LockTupleMode lockmode;
+ TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_old_new_heap;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tid_old_new_heap, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ PushActiveSnapshot(snapshot);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ ItemPointerData tid_old_new_heap;
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ Snapshot snapshot, IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ ResetClusterCurrentXids();
+
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lmode_old;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("the %s could have been dropped by another transaction",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 488ca950d9..af1945e1ed 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -873,7 +873,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5d6151dad1..13f32ede92 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4395,6 +4395,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5861,6 +5871,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0bd000acc5..529c46c186 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -619,7 +638,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1932,10 +1952,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1950,7 +1974,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2013,10 +2038,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2031,6 +2057,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2104,19 +2163,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2169,6 +2215,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2196,11 +2266,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2249,13 +2330,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..066d96dea2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -467,6 +467,57 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (This does happen when raw_heap_insert marks the TOAST record
+ * as HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -903,13 +954,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e37e22f441..ed15a0b175 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -286,7 +286,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -481,12 +481,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -554,6 +559,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -569,10 +575,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -593,7 +596,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -614,6 +617,47 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -624,7 +668,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -632,7 +676,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -649,11 +693,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
@@ -712,7 +767,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -792,7 +847,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1161,7 +1216,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1989,7 +2044,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..9fe44017a8
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,321 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ Snapshot snapshot;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple, TransactionId xid)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2100150f01..a84de0611a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, ClusterShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fa66b8017e..a6dda9b520 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index e7c8bfba94..c52ec92a97 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae..8245be7846 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66ed24e401..708d1ee27a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,9 +155,7 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -570,7 +568,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
@@ -626,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d453e224d9..6cab6ed5ee 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2787,7 +2787,7 @@ psql_completion(const char *text, int start, int end)
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -4764,7 +4764,8 @@ psql_completion(const char *text, int start, int end)
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
@@ -405,6 +408,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7d434f8e65..77d522561b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..f98b855f21 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,114 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +153,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a3360a1c5e..abbfb616ce 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -68,6 +68,8 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 934ba84f6a..cac3d7f8c7 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54f..a5f59b6c12 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,9 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 4c789279e5..22cb0702dc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1958,17 +1958,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v01-0004-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 8acfb903cb62baabea2b32174ce98b78d840e068 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:46:00 +0200
Subject: [PATCH 4/4] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 102 ++++++++++++++----
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 187 insertions(+), 59 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 02fd6d2983..cccfff62bd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -735,7 +735,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9be..050c8306da 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 6397f7f8c4..42e8118b7d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -21,6 +21,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -179,17 +180,21 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -202,7 +207,8 @@ static void process_concurrent_changes(LogicalDecodingContext *ctx,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ RewriteState rwstate);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3073,7 +3079,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3144,7 +3151,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3152,7 +3160,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3193,11 +3201,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3238,9 +3258,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3250,6 +3273,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3265,6 +3291,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3298,16 +3332,19 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
- ItemPointerData tid_old_new_heap;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
/* Location of the existing tuple in the new heap. */
ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
@@ -3330,6 +3367,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3353,7 +3394,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
ItemPointerData tid_old_new_heap;
TM_Result res;
@@ -3444,7 +3485,8 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
static void
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3468,7 +3510,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, rwstate);
}
PG_FINALLY();
{
@@ -3631,6 +3673,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3708,10 +3751,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3833,9 +3892,16 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ rwstate);
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 066d96dea2..69a43e3510 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -951,11 +951,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -980,6 +982,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1001,11 +1010,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1022,6 +1034,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1031,6 +1044,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1049,6 +1069,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1067,11 +1095,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1103,6 +1132,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index 9fe44017a8..2c33fbad82 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -162,7 +162,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -180,10 +181,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -196,7 +197,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -230,13 +232,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -301,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index f98b855f21..c394ef3871 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..1fa8f8bd6a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -99,6 +99,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
Antonin Houska <ah@cybertec.at> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)I don't have plans for it, so if you have resources, please go for it.
The first version is attached. The actual feature is in 0003. 0004 is probably
not necessary now, but I haven't realized until I coded it.
The mailing list archive indicates something is wrong with the 0003
attachment. Sending it all again, as *.tar.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
vacuum_full_concurrently_v01.tarapplication/x-tarDownload
v01-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patch 000644 001751 001751 00000035673 14643255527 024231 0 ustar 00ah ah 000000 000000 From f47a98b9b4580a581aacf73c553b87ca6bf16533 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 1/4] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..194d143cf4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ea05d4b224..488ca950d9 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -296,7 +296,7 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
dest = CreateTransientRelDestReceiver(OIDNewHeap);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dbfe0d6b1c..5d6151dad1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5841,7 +5841,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..0bd000acc5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2196,15 +2196,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v01-0002-Move-progress-related-fields-from-PgBackendStatus-to.patch 000644 001751 001751 00000014572 14643255527 024167 0 ustar 00ah ah 000000 000000 From cdf67d933a56323c0e5ca77495f60017d398bbd5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 2/4] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index bfb9b7704b..e7c8bfba94 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 1ccf4c6d83..b54a35d91c 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 3876339ee1..fe09ae8f63 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7b7f6f59d0..11cdf7f95a 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v01-0003-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patch 000644 001751 001751 00000606616 14643255527 022502 0 ustar 00ah ah 000000 000000 From 1cb536663c018d98faf349a680b773364b464026 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:45:59 +0200
Subject: [PATCH 3/4] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 114 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 80 +-
src/backend/access/heap/heapam_handler.c | 155 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/access/transam/xact.c | 52 +
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2618 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/replication/logical/decode.c | 58 +-
src/backend/replication/logical/snapbuild.c | 87 +-
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 321 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 6 +-
src/bin/psql/tab-complete.c | 5 +-
src/include/access/heapam.h | 19 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/tableam.h | 10 +
src/include/access/xact.h | 2 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 117 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 2 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 3 +
src/test/regress/expected/rules.out | 17 +-
44 files changed, 3876 insertions(+), 259 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 991f629907..fe1ba36f40 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5567,14 +5567,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5655,6 +5676,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..0fe4e9603b 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,105 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ chnages would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable
+ noticeable if too many data changes have been done to the table
+ while <command>CLUSTER</command> was waiting for the lock: those changes
+ must be processed before the files are swapped.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9857b35627..298cf7298d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91b20147a0..493c351d7f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2079,8 +2081,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * Currently we only pass HEAP_INSERT_NO_LOGICAL when doing VACUUM
+ * FULL / CLUSTER, in which case the visibility information does not
+ * change. Therefore, there's no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
@@ -2624,7 +2631,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2681,11 +2689,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2702,6 +2710,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2927,7 +2936,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -2995,8 +3005,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3017,6 +3031,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3106,10 +3129,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3148,12 +3172,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3193,6 +3216,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3981,8 +4005,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3992,7 +4020,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4225,10 +4254,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8357,7 +8386,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8368,10 +8398,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..02fd6d2983 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static bool accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -250,7 +254,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -273,7 +278,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -307,7 +313,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -325,8 +332,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
@@ -686,6 +694,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -706,6 +716,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -786,6 +798,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -840,7 +853,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -856,14 +869,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -875,7 +889,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -889,8 +903,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -901,9 +913,39 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding. INSERT_IN_PROGRESS is rejected right away because
+ * our snapshot should represent a point in time which should precede
+ * (or be equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the tuple
+ * inserted.
+ */
+ if (concurrent &&
+ (vis == HEAPTUPLE_INSERT_IN_PROGRESS ||
+ !accept_tuple_for_concurrent_copy(tuple, snapshot, buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /*
+ * In the concurrent case, we should not unlock the buffer until the
+ * tuple has been copied to the new file: if a concurrent transaction
+ * marked it updated or deleted in between, we'd fail to replay that
+ * transaction's changes because then we'd try to perform the same
+ * UPDATE / DELETE twice. XXX Should we instead create a copy of the
+ * tuple so that the buffer can be unlocked right away?
+ */
+ if (!concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -920,7 +962,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -935,6 +977,35 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /* See the comment on unlocking above. */
+ if (concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -978,7 +1049,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2583,6 +2654,56 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Check if the tuple was inserted, updated or deleted while
+ * heapam_relation_copy_for_cluster() was copying the data.
+ *
+ * 'snapshot' is used to determine whether xmin/xmax was set by a transaction
+ * that is still in-progress, or one that started in the future from the
+ * snapshot perspective.
+ *
+ * Returns true if the insertion is visible to 'snapshot', but clear xmax if
+ * it was set by a transaction which is in-progress or in the future from the
+ * snapshot perspective. (The xmax will be set later, when we decode the
+ * corresponding UPDATE / DELETE from WAL.)
+ *
+ * Returns false if the insertion is not visible to 'snapshot'.
+ */
+static bool
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple should be rejected because it was inserted
+ * concurrently.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return false;
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple.
+ */
+ return true;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909d..f9b8cb4da7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -124,6 +124,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -970,6 +982,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -989,6 +1003,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5621,6 +5650,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a819b4197c..a25c84d7ae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1415,22 +1415,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1469,6 +1454,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..fddab1cfa9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1236,16 +1236,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 194d143cf4..6397f7f8c4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,175 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by CLUSTER CONCURRENTLY by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is that multiple backends might call cluster_rel(). This is
+ * not necessarily a correctness issue, but it definitely means wasted CPU
+ * time.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ Oid reltoastrelid;
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lock_mode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(Oid relid, bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ Snapshot snapshot,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +280,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lock_mode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +294,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +304,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lock_mode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lock_mode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +382,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +421,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lock_mode);
}
else
{
@@ -246,7 +430,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lock_mode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +447,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lock_mode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +468,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lock_mode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lock_mode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +504,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +527,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +598,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +613,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +624,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +635,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +651,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +682,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +697,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +711,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(tableOid, !success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +896,99 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("relation \"%s\" has insufficient replication identity",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("relation \"%s\" has no identity index",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,19 +997,83 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
LOCKMODE lmode_new;
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
+
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, indexOid, true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -673,31 +1092,52 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1288,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1320,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1347,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1428,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtrensaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1495,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1506,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1961,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +1993,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2272,1938 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(Oid relid, bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+
+ /* Remove the relation from the hash. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Disable end_concurrent_cluster_on_exit_callback(). */
+ if (OidIsValid(clustered_rel))
+ clustered_rel = InvalidOid;
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ else
+ key.relid = InvalidOid;
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it did for any reason, make sure the transaction is
+ * aborted: if other transactions, while changing the contents of the
+ * relation, didn't know that CLUSTER CONCURRENTLY was in progress, they
+ * could have missed to WAL enough information, and thus we could have
+ * produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Missing TOAST relation indicates that it could have been VACUUMed
+ * or CLUSTERed by another backend while we did not hold a lock on it.
+ */
+ if (entry_toast == NULL && OidIsValid(key.relid))
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(clustered_rel, true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /*
+ * TOAST should not really change, but be careful. If it did, we would be
+ * unable to remove the new one from ClusteredRelsHash.
+ */
+ if (OidIsValid(clustered_rel_toast) &&
+ clustered_rel_toast != reltoastrelid)
+ ereport(ERROR,
+ (errmsg("TOAST relation changed by another transaction")));
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_src, td_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should break
+ * anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_src = RelationGetDescr(index);
+ td_dst = palloc(TupleDescSize(td_src));
+ TupleDescCopy(td_dst, td_src);
+ result->ind_tupdescs[i] = td_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->reltoastrelid = reltoastrelid;
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (rel->rd_rel->reltoastrelid != cat_state->reltoastrelid)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ Snapshot snapshot;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * does not contain (sub)transactions other than the one whose
+ * data changes we are applying.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ ResetClusterCurrentXids();
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
+ {
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ Snapshot snapshot = change->snapshot;
+ List *recheck;
+
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ PushActiveSnapshot(snapshot);
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ LockTupleMode lockmode;
+ TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_old_new_heap;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tid_old_new_heap, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ PushActiveSnapshot(snapshot);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ ItemPointerData tid_old_new_heap;
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ Snapshot snapshot, IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ ResetClusterCurrentXids();
+
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lmode_old;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("the %s could have been dropped by another transaction",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 488ca950d9..af1945e1ed 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -873,7 +873,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5d6151dad1..13f32ede92 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4395,6 +4395,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5861,6 +5871,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0bd000acc5..529c46c186 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -619,7 +638,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1932,10 +1952,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1950,7 +1974,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2013,10 +2038,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2031,6 +2057,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2104,19 +2163,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2169,6 +2215,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2196,11 +2266,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2249,13 +2330,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..066d96dea2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -467,6 +467,57 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (This does happen when raw_heap_insert marks the TOAST record
+ * as HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -903,13 +954,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e37e22f441..ed15a0b175 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -286,7 +286,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -481,12 +481,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -554,6 +559,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -569,10 +575,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -593,7 +596,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -614,6 +617,47 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -624,7 +668,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -632,7 +676,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -649,11 +693,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
@@ -712,7 +767,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -792,7 +847,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1161,7 +1216,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1989,7 +2044,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..9fe44017a8
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,321 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ Snapshot snapshot;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple, TransactionId xid)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2100150f01..a84de0611a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, ClusterShmemSize());
#ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
#endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fa66b8017e..a6dda9b520 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index e7c8bfba94..c52ec92a97 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae..8245be7846 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66ed24e401..708d1ee27a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,9 +155,7 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -570,7 +568,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
@@ -626,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d453e224d9..6cab6ed5ee 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2787,7 +2787,7 @@ psql_completion(const char *text, int start, int end)
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -4764,7 +4764,8 @@ psql_completion(const char *text, int start, int end)
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
@@ -405,6 +408,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7d434f8e65..77d522561b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..f98b855f21 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,114 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +153,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a3360a1c5e..abbfb616ce 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -68,6 +68,8 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 934ba84f6a..cac3d7f8c7 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54f..a5f59b6c12 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,9 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 4c789279e5..22cb0702dc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1958,17 +1958,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v01-0004-Call-logical_rewrite_heap_tuple-when-applying-concur.patch 000644 001751 001751 00000061613 14643255527 024435 0 ustar 00ah ah 000000 000000 From 8acfb903cb62baabea2b32174ce98b78d840e068 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2024 17:46:00 +0200
Subject: [PATCH 4/4] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 102 ++++++++++++++----
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 187 insertions(+), 59 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 02fd6d2983..cccfff62bd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -735,7 +735,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9be..050c8306da 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 6397f7f8c4..42e8118b7d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -21,6 +21,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -179,17 +180,21 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -202,7 +207,8 @@ static void process_concurrent_changes(LogicalDecodingContext *ctx,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ RewriteState rwstate);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3073,7 +3079,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3144,7 +3151,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3152,7 +3160,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3193,11 +3201,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3238,9 +3258,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3250,6 +3273,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3265,6 +3291,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3298,16 +3332,19 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
- ItemPointerData tid_old_new_heap;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
/* Location of the existing tuple in the new heap. */
ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
@@ -3330,6 +3367,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3353,7 +3394,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
ItemPointerData tid_old_new_heap;
TM_Result res;
@@ -3444,7 +3485,8 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
static void
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3468,7 +3510,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, rwstate);
}
PG_FINALLY();
{
@@ -3631,6 +3673,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3708,10 +3751,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3833,9 +3892,16 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ rwstate);
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 066d96dea2..69a43e3510 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -951,11 +951,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -980,6 +982,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1001,11 +1010,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1022,6 +1034,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1031,6 +1044,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1049,6 +1069,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1067,11 +1095,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1103,6 +1132,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index 9fe44017a8..2c33fbad82 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -162,7 +162,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -180,10 +181,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -196,7 +197,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -230,13 +232,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -301,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index f98b855f21..c394ef3871 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..1fa8f8bd6a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -99,6 +99,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
On 2024-Jul-09, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)I don't have plans for it, so if you have resources, please go for it.
The first version is attached. The actual feature is in 0003. 0004 is probably
not necessary now, but I haven't realized until I coded it.
Thank you, this is great. I'll be studying this during the next
commitfest.
BTW I can apply 0003 from this email perfectly fine, but you're right
that the archives don't show the file name. I suspect the
"Content-Disposition: inline" PLUS the Content-Type text/plain are what
cause the problem -- for instance, [1]/messages/by-id/32781.1714378236@antos doesn't have a problem and they
do have inline content disposition, but the content-type is not
text/plain. In any case, I encourage you not to send patches as
tarballs :-)
[1]: /messages/by-id/32781.1714378236@antos
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Jul-09, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Is your plan to work on it soon or should I try to write a draft patch? (I
assume this is for PG >= 18.)I don't have plans for it, so if you have resources, please go for it.
The first version is attached. The actual feature is in 0003. 0004 is probably
not necessary now, but I haven't realized until I coded it.Thank you, this is great. I'll be studying this during the next
commitfest.
Thanks. I'll register it in the CF application.
BTW I can apply 0003 from this email perfectly fine, but you're right
that the archives don't show the file name. I suspect the
"Content-Disposition: inline" PLUS the Content-Type text/plain are what
cause the problem -- for instance, [1] doesn't have a problem and they
do have inline content disposition, but the content-type is not
text/plain. In any case, I encourage you not to send patches as
tarballs :-)
You're right, "Content-Disposition" is the problem. I forgot that "attachment"
is better for patches and my email client (emacs+nmh) defaults to
"inline". I'll pay attention next time.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Hi!
I'm interested in the vacuum concurrently feature being inside the
core, so will try to review patch set and give valuable feedback. For
now, just a few little thoughts..
The first version is attached. The actual feature is in 0003. 0004 is probably
not necessary now, but I haven't realized until I coded it.
The logical replication vacuum approach is a really smart idea, I like
it. As far as I understand, pg_squeeze works well in real production
databases, which
gives us hope that the vacuum concurrent feature in core will be good
too... What is the size of the biggest relation successfully vacuumed
via pg_squeeze?
Looks like in case of big relartion or high insertion load,
replication may lag and never catch up...
However, in general, the 3rd patch is really big, very hard to
comprehend. Please consider splitting this into smaller (and
reviewable) pieces.
Also, we obviously need more tests on this. Both tap-test and
regression tests I suppose.
One more thing is about pg_squeeze background workers. They act in an
autovacuum-like fashion, aren't they? Maybe we can support this kind
of relation processing in core too?
Hi
ne 21. 7. 2024 v 17:13 odesílatel Kirill Reshke <reshkekirill@gmail.com>
napsal:
Hi!
I'm interested in the vacuum concurrently feature being inside the
core, so will try to review patch set and give valuable feedback. For
now, just a few little thoughts..One more thing is about pg_squeeze background workers. They act in an
autovacuum-like fashion, aren't they? Maybe we can support this kind
of relation processing in core too?
I don't think it is necessary when this feature will be an internal
feature.
I agree so this feature is very important, I proposed it (and I very happy
so Tonda implemented it), but I am not sure, if usage of this should be
automatized, and if it should be, then
a) probably autovacuum should do,
b) we can move a discussion after vacuum full concurrently will be merged
to upstream, please. Isn't very practical to have too many open targets.
Regards
Pavel
Also, we obviously need more tests on this. Both tap-test and
regression tests I suppose.
The one simple test to this patch can be done this way:
1) create test relation (call it vac_conc_r1 for example) and fill it
with dead tuples (insert + update or insert + delete)
2) create injection point preventing concurrent vacuum from compiling.
3) run concurrent vacuum (VACUUM FULL CONCURRENTLY) in separate thread
or in some other async way.
4) Insert new data in relation to vac_conc_r1.
5) Release injection point, assert that vacuum completed successfully.
6) check that all data is present in vac_conc_r1 (data from step 1 and
from step 4).
This way we can catch some basic buggs, if some paths of VACUUM
CONCURRENTLY will be touched in the future.
The problem with this test is: i don't know how to do anything async
in current TAP tests (needed in step 3). Also, maybe test with async
interaction
may be too flappy (producing false negative flaps) to support.
Sequential test for this feature would be much better, but I can't
think of one.
Also, should we create a cf entry for this thread already?
On Mon, Jul 22, 2024 at 01:23:03PM +0500, Kirill Reshke wrote:
Also, should we create a cf entry for this thread already?
I was wondering about this as well, but there is one for the upcoming
commitfest already:
https://commitfest.postgresql.org/49/5117/
Michael
Hi!
On Tue, 30 Jan 2024 at 15:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
FWIW a newer, more modern and more trustworthy alternative to pg_repack
is pg_squeeze, which I discovered almost by random chance, and soon
discovered I liked it much more.
Can you please clarify this a bit more? What is the exact reason for
pg_squeeze being more trustworthy than pg_repack?
Is there something about the logical replication approach that makes
it more bulletproof than the trigger-based repack approach?
Also, I was thinking about pg_repack vs pg_squeeze being used for the
VACUUM FULL CONCURRENTLY feature, and I'm a bit suspicious about the
latter.
If I understand correctly, we essentially parse the whole WAL to
obtain info about one particular relation changes. That may be a big
overhead, whereas the trigger approach does
not suffer from this. So, there is the chance that VACUUM FULL
CONCURRENTLY will never keep up with vacuumed relation changes. Am I
right?
Kirill Reshke <reshkekirill@gmail.com> wrote:
Also, I was thinking about pg_repack vs pg_squeeze being used for the
VACUUM FULL CONCURRENTLY feature, and I'm a bit suspicious about the
latter.
If I understand correctly, we essentially parse the whole WAL to
obtain info about one particular relation changes. That may be a big
overhead,
pg_squeeze is an extension but the logical decoding is performed by the core,
so there is no way to ensure that data changes of the "other tables" are not
decoded. However, it might be possible if we integrate the functionality into
the core. I'll consider doing so in the next version of [1]https://commitfest.postgresql.org/49/5117/.
whereas the trigger approach does not suffer from this. So, there is the
chance that VACUUM FULL CONCURRENTLY will never keep up with vacuumed
relation changes. Am I right?
Perhaps it can happen, but note that trigger processing is also not free and
that in this case the cost is paid by the applications. So while VACUUM FULL
CONCURRENTLY (based on logical decoding) might fail to catch-up, the trigger
based solution may slow down the applications that execute DML commands while
the table is being rewritten.
[1]: https://commitfest.postgresql.org/49/5117/
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Kirill Reshke <reshkekirill@gmail.com> wrote:
What is the size of the biggest relation successfully vacuumed
via pg_squeeze?
Looks like in case of big relartion or high insertion load,
replication may lag and never catch up...
Users reports problems rather than successes, so I don't know. 400 GB was
reported in [1]https://github.com/cybertec-postgresql/pg_squeeze/issues/51 but it's possible that the table size for this test was
determined based on available disk space.
I think that the amount of data changes performed during the "squeezing"
matters more than the table size. In [2]https://github.com/cybertec-postgresql/pg_squeeze/issues/21#issuecomment-514495369 one user reported "thounsands of
UPSERTs per second", but the amount of data also depends on row size, which he
didn't mention.
pg_squeeze gives up if it fails to catch up a few times. The first version of
my patch does not check this, I'll add the corresponding code in the next
version.
However, in general, the 3rd patch is really big, very hard to
comprehend. Please consider splitting this into smaller (and
reviewable) pieces.
I'll try to move some preparation steps into separate diffs, but not sure if
that will make the main diff much smaller. I prefer self-contained patches, as
also explained in [3]http://peter.eisentraut.org/blog/2024/05/14/when-to-split-patches-for-postgresql.
Also, we obviously need more tests on this. Both tap-test and
regression tests I suppose.
Sure. The next version will use the injection points to test if "concurrent
data changes" are processed correctly.
One more thing is about pg_squeeze background workers. They act in an
autovacuum-like fashion, aren't they? Maybe we can support this kind
of relation processing in core too?
Maybe later. Even just adding the CONCURRENTLY option to CLUSTER and VACUUM
FULL requires quite some effort.
[1]: https://github.com/cybertec-postgresql/pg_squeeze/issues/51
[2]: https://github.com/cybertec-postgresql/pg_squeeze/issues/21#issuecomment-514495369
https://github.com/cybertec-postgresql/pg_squeeze/issues/21#issuecomment-514495369
[3]: http://peter.eisentraut.org/blog/2024/05/14/when-to-split-patches-for-postgresql
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On Fri, 2 Aug 2024 at 11:09, Antonin Houska <ah@cybertec.at> wrote:
Kirill Reshke <reshkekirill@gmail.com> wrote:
However, in general, the 3rd patch is really big, very hard to
comprehend. Please consider splitting this into smaller (and
reviewable) pieces.I'll try to move some preparation steps into separate diffs, but not sure if
that will make the main diff much smaller. I prefer self-contained patches, as
also explained in [3].
Thanks for sharing [3], it is a useful link.
There is actually one more case when ACCESS EXCLUSIVE is held: during
table rewrite (AT set TAM, AT set Tablespace and AT alter column type
are some examples).
This can be done CONCURRENTLY too, using the same logical replication
approach, or do I miss something?
I'm not saying we must do it immediately, this should be a separate
thread, but we can do some preparation work here.
I can see that a bunch of functions which are currently placed in
cluster.c can be moved to something like
logical_rewrite_heap.c. ConcurrentChange struct and
apply_concurrent_insert function is one example of such.
So, if this is the case, 0003 patch can be splitted in two:
The first one is general utility code for logical table rewrite
The second one with actual VACUUM CONCURRENTLY feature.
What do you think?
Attached is version 2, the feature itself is now in 0004.
Unlike version 1, it contains some regression tests (0006) and a new GUC to
control how long the AccessExclusiveLock may be held (0007).
Kirill Reshke <reshkekirill@gmail.com> wrote:
On Fri, 2 Aug 2024 at 11:09, Antonin Houska <ah@cybertec.at> wrote:
Kirill Reshke <reshkekirill@gmail.com> wrote:
However, in general, the 3rd patch is really big, very hard to
comprehend. Please consider splitting this into smaller (and
reviewable) pieces.I'll try to move some preparation steps into separate diffs, but not sure if
that will make the main diff much smaller. I prefer self-contained patches, as
also explained in [3].Thanks for sharing [3], it is a useful link.
There is actually one more case when ACCESS EXCLUSIVE is held: during
table rewrite (AT set TAM, AT set Tablespace and AT alter column type
are some examples).
This can be done CONCURRENTLY too, using the same logical replication
approach, or do I miss something?
Yes, the logical replication can potentially be used in other cases.
I'm not saying we must do it immediately, this should be a separate
thread, but we can do some preparation work here.I can see that a bunch of functions which are currently placed in
cluster.c can be moved to something like
logical_rewrite_heap.c. ConcurrentChange struct and
apply_concurrent_insert function is one example of such.So, if this is the case, 0003 patch can be splitted in two:
The first one is general utility code for logical table rewrite
The second one with actual VACUUM CONCURRENTLY feature.
What do you think?
I can imagine moving the function process_concurrent_changes() and subroutines
to a different file (e.g. rewriteheap.c), but moving it into a separate diff
that does not contain any call of the function makes little sense to me. Such
a diff would not add any useful functionality and could not be considered
refactoring either.
So far I at least moved some code to separate diffs: 0003 and 0005. I'll move
more if I find sensible opportunity in the future.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v02-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From 16414c3a1329db9264b15a44a43d661c02ac5329 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 1/8] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..194d143cf4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 91f0fd6ea3..79558cecec 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -318,7 +318,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
/* Generate the data, if wanted. */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dac39df83a..7fb088df72 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5802,7 +5802,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..d32068b5d5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2193,15 +2193,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v02-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From 6ab9ccd3bbe82fe202de3ec0945a73f6f56c4d3c Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index c78c5eb507..cc9b4cf0dc 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 1ccf4c6d83..b54a35d91c 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 3221137123..3587bd3150 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7b7f6f59d0..11cdf7f95a 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v02-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From a1f5c8a642099863c9d2c3080bba30d1b95474aa Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..f96bafe5ec 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -579,10 +579,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -624,6 +621,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -634,7 +656,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -642,7 +664,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -659,11 +681,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..b8b500f48f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,7 +155,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -570,7 +569,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index caa5113ff8..ad06e80784 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..e7ac89f484 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,7 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v02-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From a262768e522a40e554f4886e787c9ec7b7e15fdb Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 114 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 141 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2573 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 277 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 104 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
40 files changed, 3568 insertions(+), 199 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9..526cc581cd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5567,14 +5567,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5655,6 +5676,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..0fe4e9603b 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,105 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ chnages would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable
+ noticeable if too many data changes have been done to the table
+ while <command>CLUSTER</command> was waiting for the lock: those changes
+ must be processed before the files are swapped.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9857b35627..298cf7298d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91b20147a0..1fdcc0abee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2079,8 +2079,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d4..4ddb1c4a0c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static bool accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -682,6 +686,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -702,6 +708,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -782,6 +790,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -836,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -852,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -871,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -885,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -897,9 +905,39 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding. INSERT_IN_PROGRESS is rejected right away because
+ * our snapshot should represent a point in time which should precede
+ * (or be equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the tuple
+ * inserted.
+ */
+ if (concurrent &&
+ (vis == HEAPTUPLE_INSERT_IN_PROGRESS ||
+ !accept_tuple_for_concurrent_copy(tuple, snapshot, buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /*
+ * In the concurrent case, we should not unlock the buffer until the
+ * tuple has been copied to the new file: if a concurrent transaction
+ * marked it updated or deleted in between, we'd fail to replay that
+ * transaction's changes because then we'd try to perform the same
+ * UPDATE / DELETE twice. XXX Should we instead create a copy of the
+ * tuple so that the buffer can be unlocked right away?
+ */
+ if (!concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -916,7 +954,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -931,6 +969,35 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /* See the comment on unlocking above. */
+ if (concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -974,7 +1041,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2579,6 +2646,56 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Check if the tuple was inserted, updated or deleted while
+ * heapam_relation_copy_for_cluster() was copying the data.
+ *
+ * 'snapshot' is used to determine whether xmin/xmax was set by a transaction
+ * that is still in-progress, or one that started in the future from the
+ * snapshot perspective.
+ *
+ * Returns true if the insertion is visible to 'snapshot', but clear xmax if
+ * it was set by a transaction which is in-progress or in the future from the
+ * snapshot perspective. (The xmax will be set later, when we decode the
+ * corresponding UPDATE / DELETE from WAL.)
+ *
+ * Returns false if the insertion is not visible to 'snapshot'.
+ */
+static bool
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple should be rejected because it was inserted
+ * concurrently.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return false;
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple.
+ */
+ return true;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 33759056e3..aab2712794 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1415,22 +1415,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1469,6 +1454,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..fddab1cfa9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1236,16 +1236,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 194d143cf4..7bd81ff84b 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,184 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by CLUSTER CONCURRENTLY by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ Oid reltoastrelid;
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lock_mode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(Oid relid, bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +289,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lock_mode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +303,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +313,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lock_mode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lock_mode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +391,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +430,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lock_mode);
}
else
{
@@ -246,7 +439,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lock_mode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +456,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lock_mode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +477,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lock_mode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lock_mode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +513,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +536,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +607,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +622,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +633,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +644,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +660,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +691,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +706,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +720,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(tableOid, !success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +905,99 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("relation \"%s\" has insufficient replication identity",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("relation \"%s\" has no identity index",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,19 +1006,83 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
LOCKMODE lmode_new;
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
+
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, indexOid, true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -673,31 +1101,52 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1297,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1329,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1356,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1437,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtrensaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1504,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1515,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1970,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +2002,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2281,1884 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("consider increasing the \"max_replication_slots\" configuration parameter")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /* Avoid logical decoding of other relations. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(Oid relid, bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+
+ /* Remove the relation from the hash. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Disable cluster_before_shmem_exit_callback(). */
+ if (OidIsValid(clustered_rel))
+ clustered_rel = InvalidOid;
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ else
+ key.relid = InvalidOid;
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it did for any reason, make sure the transaction is
+ * aborted: if other transactions, while changing the contents of the
+ * relation, didn't know that CLUSTER CONCURRENTLY was in progress, they
+ * could have missed to WAL enough information, and thus we could have
+ * produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Missing TOAST relation indicates that it could have been VACUUMed
+ * or CLUSTERed by another backend while we did not hold a lock on it.
+ */
+ if (entry_toast == NULL && OidIsValid(key.relid))
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(clustered_rel, true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /*
+ * TOAST should not really change, but be careful. If it did, we would be
+ * unable to remove the new one from ClusteredRelsHash.
+ */
+ if (OidIsValid(clustered_rel_toast) &&
+ clustered_rel_toast != reltoastrelid)
+ ereport(ERROR,
+ (errmsg("TOAST relation changed by another transaction")));
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_src, td_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should break
+ * anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_src = RelationGetDescr(index);
+ td_dst = palloc(TupleDescSize(td_src));
+ TupleDescCopy(td_dst, td_src);
+ result->ind_tupdescs[i] = td_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->reltoastrelid = reltoastrelid;
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != cat_state->reltoastrelid)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change->kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lmode_old;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("the %s could have been dropped by another transaction",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 79558cecec..ff89236add 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -906,7 +906,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7fb088df72..f58c41c794 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4390,6 +4390,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5822,6 +5832,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32068b5d5..359fbabd5d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1929,10 +1949,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1947,7 +1971,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2010,10 +2035,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2028,6 +2054,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2101,19 +2160,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2166,6 +2212,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2193,11 +2263,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2246,13 +2327,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..b3fb5d1825 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ XLogRecGetBlockTag(r, 0, &locator, NULL, NULL);
+
+ if (!RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index f96bafe5ec..b5e12a5cc9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -625,6 +625,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..c6baca1171
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,277 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 35fa2e1dda..588d853194 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/waitlsn.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c9..04e7571e70 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index cc9b4cf0dc..0ba35a847e 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6..8b9dfe865b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66ed24e401..708d1ee27a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b8b500f48f..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -156,7 +156,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -625,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a7ccde6d7d..57acf2a279 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2773,7 +2773,7 @@ psql_completion(const char *text, int start, int end)
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -4744,7 +4744,8 @@ psql_completion(const char *text, int start, int end)
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..8687ec8796 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -405,6 +405,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7d434f8e65..77d522561b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..959899a7cc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,101 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +140,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index ad06e80784..b38eb0d530 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 934ba84f6a..cac3d7f8c7 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd..cff17a6bd0 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index e7ac89f484..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -69,6 +69,8 @@ extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 862433ee52..8a3eaf2a7f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1958,17 +1958,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v02-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 598d86e94165e672a88005bb22c2e1f01f5d46fe Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does note expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
12 files changed, 386 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1fdcc0abee..69bf4d1c8d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2630,7 +2632,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2687,11 +2690,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2708,6 +2711,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2933,7 +2937,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3001,8 +3006,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3023,6 +3032,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3112,10 +3130,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3154,12 +3173,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3199,6 +3217,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3987,8 +4006,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3998,7 +4021,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4231,10 +4255,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8363,7 +8387,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8374,10 +8399,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4ddb1c4a0c..a8999a3e72 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dfc8cf2dcf..954356b5c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5627,6 +5656,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7bd81ff84b..b9aeb237ba 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -202,6 +202,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2994,6 +2995,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3146,6 +3150,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw;
ConcurrentChange *change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3214,8 +3219,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3226,6 +3253,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3238,11 +3267,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change->kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change->kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3262,10 +3294,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3273,6 +3325,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3283,6 +3336,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3300,18 +3355,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3321,6 +3394,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3331,7 +3405,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3349,7 +3438,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3357,7 +3446,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3429,6 +3518,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index b3fb5d1825..1f30e12537 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -927,13 +990,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b5e12a5cc9..bc1814e6f6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -294,7 +294,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -491,12 +491,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -564,6 +569,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -600,7 +606,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -641,7 +647,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -775,7 +781,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -855,7 +861,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1224,7 +1230,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -2059,7 +2065,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index c6baca1171..db6a2bcf1f 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -257,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = (char *) change + sizeof(ConcurrentChange);
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -267,6 +318,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 8687ec8796..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 959899a7cc..61ea314399 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -99,6 +107,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -119,6 +129,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
--
2.45.2
v02-0006-Add-regression-tests.patchtext/x-diffDownload
From bedd75e6c4e8e3ce13b91d3442f40814ef64b164 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b9aeb237ba..490fa3cfef 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3741,6 +3742,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index ed28cd13a8..799b04e959 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index c9e357f644..7739b28c19 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -42,7 +42,10 @@ tests += {
'isolation': {
'specs': [
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v02-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From bd38543a06b7540a6468cecc435c180cfa03a4f0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 291 insertions(+), 22 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2937384b00..cd2650520d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10570,6 +10570,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 0fe4e9603b..0e738d21b3 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -141,10 +141,13 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short. However, the time might still be noticeable
- noticeable if too many data changes have been done to the table
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table
while <command>CLUSTER</command> was waiting for the lock: those changes
- must be processed before the files are swapped.
+ must be processed before the files are swapped. If you are worried about
+ this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8999a3e72..61b8d7e8e5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -998,7 +998,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 490fa3cfef..91bd1a3bca 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -190,7 +201,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -207,13 +219,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3047,7 +3061,8 @@ get_changed_tuple(ConcurrentChange *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3085,6 +3100,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3127,7 +3145,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3157,6 +3176,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3281,10 +3303,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3487,11 +3521,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3500,10 +3538,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3515,7 +3562,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3525,6 +3572,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3685,6 +3754,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lmode_old = LOCK_CLUSTER_CONCURRENT;
@@ -3764,7 +3835,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3886,9 +3958,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index af227b1f24..b84a9cd866 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2772,6 +2773,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..9dc060c59f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -724,6 +724,7 @@
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 61ea314399..5d904ce985 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
/*
* Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
*
@@ -149,7 +151,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
const char *stmt);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
v02-0008-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 5aa2f7a9d353ee253edd885dc6ca238ae9aa5bad Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 27 Aug 2024 12:13:18 +0200
Subject: [PATCH 8/8] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 194 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 61b8d7e8e5..c39a9ac41d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,7 +731,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9be..050c8306da 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 91bd1a3bca..2f94e143ad 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -202,17 +203,21 @@ static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -226,7 +231,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3146,7 +3152,7 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3220,7 +3226,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3228,7 +3235,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3270,11 +3277,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3327,9 +3346,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3339,6 +3361,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3354,6 +3379,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3387,15 +3420,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3405,7 +3445,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3415,6 +3455,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3438,8 +3482,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3448,7 +3493,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3529,7 +3577,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3562,7 +3611,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3747,6 +3797,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3832,11 +3883,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3958,6 +4024,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* cluster_max_xlock_time is not exceeded.
@@ -3985,11 +4056,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1f30e12537..3c9ab8fa61 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -987,11 +987,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1016,6 +1018,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1037,11 +1046,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1058,6 +1070,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1067,6 +1080,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1085,6 +1105,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1103,11 +1131,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1139,6 +1168,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index db6a2bcf1f..54a7e3ca68 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -169,7 +169,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -187,10 +188,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -203,7 +204,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -237,13 +239,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -308,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 5d904ce985..69a9aba050 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -76,6 +76,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 851a001c8b..1fa8f8bd6a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -99,6 +99,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
Hi,
On Tue, Aug 27, 2024 at 8:01 PM Antonin Houska <ah@cybertec.at> wrote:
Attached is version 2, the feature itself is now in 0004.
Unlike version 1, it contains some regression tests (0006) and a new GUC to
control how long the AccessExclusiveLock may be held (0007).Kirill Reshke <reshkekirill@gmail.com> wrote:
On Fri, 2 Aug 2024 at 11:09, Antonin Houska <ah@cybertec.at> wrote:
Kirill Reshke <reshkekirill@gmail.com> wrote:
However, in general, the 3rd patch is really big, very hard to
comprehend. Please consider splitting this into smaller (and
reviewable) pieces.I'll try to move some preparation steps into separate diffs, but not sure if
that will make the main diff much smaller. I prefer self-contained patches, as
also explained in [3].Thanks for sharing [3], it is a useful link.
There is actually one more case when ACCESS EXCLUSIVE is held: during
table rewrite (AT set TAM, AT set Tablespace and AT alter column type
are some examples).
This can be done CONCURRENTLY too, using the same logical replication
approach, or do I miss something?Yes, the logical replication can potentially be used in other cases.
I'm not saying we must do it immediately, this should be a separate
thread, but we can do some preparation work here.I can see that a bunch of functions which are currently placed in
cluster.c can be moved to something like
logical_rewrite_heap.c. ConcurrentChange struct and
apply_concurrent_insert function is one example of such.So, if this is the case, 0003 patch can be splitted in two:
The first one is general utility code for logical table rewrite
The second one with actual VACUUM CONCURRENTLY feature.What do you think?
I can imagine moving the function process_concurrent_changes() and subroutines
to a different file (e.g. rewriteheap.c), but moving it into a separate diff
that does not contain any call of the function makes little sense to me. Such
a diff would not add any useful functionality and could not be considered
refactoring either.So far I at least moved some code to separate diffs: 0003 and 0005. I'll move
more if I find sensible opportunity in the future.--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Thanks for working on this, I think this is a very useful feature.
The patch doesn't compile in the debug build with errors:
../postgres/src/backend/commands/cluster.c: In function ‘get_catalog_state’:
../postgres/src/backend/commands/cluster.c:2771:33: error: declaration
of ‘td_src’ shadows a previous local [-Werror=shadow=compatible-local]
2771 | TupleDesc td_src, td_dst;
| ^~~~~~
../postgres/src/backend/commands/cluster.c:2741:25: note: shadowed
declaration is here
2741 | TupleDesc td_src = RelationGetDescr(rel);
you forgot the meson build for pgoutput_cluster
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 78c5726814..0f9141a4ac 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
I noticed that you use lmode/lock_mode/lockmode, there are lmode and lockmode
in the codebase, but I remember someone proposed all changes to lockmode, how
about sticking to lockmode in your patch?
0004:
+ sure that the old files do not change during the processing because the
+ chnages would get lost due to the swap.
typo
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
I remember pg_squeeze also did some logical decoding after getting the exclusive
lock, if that is still true, I guess the doc above is not precise.
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started.
Do you mean after the *logical decoding* started here? If CLUSTER CONCURRENTLY
does not order rows at all, why bother implementing it?
+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
errhint messages should end with a dot. Why hardcoded to "CLUSTER CONCURRENTLY"
instead of parameter *stmt*.
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtrensaction so that
typo
I did not see VacuumStmt changes in gram.y, how do we suppose to
use the vacuum full concurrently? I tried the following but no success.
[local] postgres@demo:5432-36097=# vacuum (concurrently) aircrafts_data;
ERROR: CONCURRENTLY can only be specified with VACUUM FULL
[local] postgres@demo:5432-36097=# vacuum full (concurrently) full
aircrafts_data;
ERROR: syntax error at or near "("
LINE 1: vacuum full (concurrently) full aircrafts_data;
--
Regards
Junwang Zhao
Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for working on this, I think this is a very useful feature.
The patch doesn't compile in the debug build with errors:
../postgres/src/backend/commands/cluster.c: In function ‘get_catalog_state’:
../postgres/src/backend/commands/cluster.c:2771:33: error: declaration
of ‘td_src’ shadows a previous local [-Werror=shadow=compatible-local]
2771 | TupleDesc td_src, td_dst;
| ^~~~~~
../postgres/src/backend/commands/cluster.c:2741:25: note: shadowed
declaration is here
2741 | TupleDesc td_src = RelationGetDescr(rel);
ok, gcc14 complains here, the compiler I used before did not. Fixed.
you forgot the meson build for pgoutput_cluster
diff --git a/src/backend/meson.build b/src/backend/meson.build index 78c5726814..0f9141a4ac 100644 --- a/src/backend/meson.build +++ b/src/backend/meson.build @@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + { subdir('jit/llvm') subdir('replication/libpqwalreceiver') subdir('replication/pgoutput') +subdir('replication/pgoutput_cluster')
Fixed, thanks. That might be the reason for the cfbot to fail when using
meson.
I noticed that you use lmode/lock_mode/lockmode, there are lmode and lockmode
in the codebase, but I remember someone proposed all changes to lockmode, how
about sticking to lockmode in your patch?
Fixed.
0004:
+ sure that the old files do not change during the processing because the + chnages would get lost due to the swap. typo
Fixed.
+ files. The data changes that took place during the creation of the new + table and index files are captured using logical decoding + (<xref linkend="logicaldecoding"/>) and applied before + the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock + is typically held only for the time needed to swap the files, which + should be pretty short.I remember pg_squeeze also did some logical decoding after getting the exclusive
lock, if that is still true, I guess the doc above is not precise.
The decoding takes place before requesting the lock, as well as after
that. I've adjusted the paragraph, see 0007.
+ Note that <command>CLUSTER</command> with the + the <literal>CONCURRENTLY</literal> option does not try to order the + rows inserted into the table after the clustering started.Do you mean after the *logical decoding* started here? If CLUSTER CONCURRENTLY
does not order rows at all, why bother implementing it?
The rows inserted before CLUSTER (CONCURRENTLY) started do get ordered, the
rows inserted after that do not. (Actually what matters is when the snapshot
for the initial load is created, but that happens in very early stage of the
processing. Not sure if user is interested in such implementation details.)
+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
errhint messages should end with a dot. Why hardcoded to "CLUSTER CONCURRENTLY"
instead of parameter *stmt*.
Fixed.
+ ResourceOwner oldowner = CurrentResourceOwner; + + /* + * In the CONCURRENT case, do the planning in a subtrensaction so that typo
Fixed.
I did not see VacuumStmt changes in gram.y, how do we suppose to
use the vacuum full concurrently? I tried the following but no success.
With the "parethesized syntax", new options can be added w/o changing
gram.y. (While the "unparenthesized syntax" is deprecated.)
[local] postgres@demo:5432-36097=# vacuum (concurrently) aircrafts_data;
ERROR: CONCURRENTLY can only be specified with VACUUM FULL
The "lazy" VACUUM works concurrently as such.
[local] postgres@demo:5432-36097=# vacuum full (concurrently) full
aircrafts_data;
ERROR: syntax error at or near "("
LINE 1: vacuum full (concurrently) full aircrafts_data;
This is not specific to the CONCURRENTLY option. For example:
postgres=3D# vacuum full (analyze) full aircrafts_data;
ERROR: syntax error at or near "("
LINE 1: vacuum full (analyze) full aircrafts_data;
(You seem to combine the parenthesized syntax with the unparenthesized.)
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v03-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From bc2372ebcc95229a0657d8f1ab5f7a9976a8dbeb Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:27 +0200
Subject: [PATCH 1/8] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..bedc177ce4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lockmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lockmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lockmode_new);
+ Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index b2457f121a..7da6647f8f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -318,7 +318,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
/* Generate the data, if wanted. */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3cc6f8f69..2b20b03224 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5783,7 +5783,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..d32068b5d5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2193,15 +2193,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v03-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From dc0f904f49fec1a3c9f8873e74dca0a8a6206052 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:27 +0200
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index c78c5eb507..cc9b4cf0dc 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 34a55e2177..2b77fd8526 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97dc09ac0d..a005b746df 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 97874300c3..335faafcef 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v03-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From f1f6650647b36bbdabf8fccef967a4a9d09977a0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:27 +0200
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..4923e35e92 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -579,10 +579,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -624,6 +621,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -634,7 +656,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -642,7 +664,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -659,11 +681,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..b8b500f48f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,7 +155,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -570,7 +569,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index caa5113ff8..ad06e80784 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..e7ac89f484 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,7 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v03-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From b4cc790dd100143ee60568b4302d8206022ec2ed Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:28 +0200
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 111 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 141 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2584 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/meson.build | 1 +
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 277 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 104 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
41 files changed, 3572 insertions(+), 204 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..ee26b03a05 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5640,14 +5640,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5728,6 +5749,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..d8c3edb432 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,102 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9857b35627..298cf7298d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91b20147a0..1fdcc0abee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2079,8 +2079,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d4..4ddb1c4a0c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static bool accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -682,6 +686,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -702,6 +708,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -782,6 +790,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -836,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -852,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -871,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -885,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -897,9 +905,39 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding. INSERT_IN_PROGRESS is rejected right away because
+ * our snapshot should represent a point in time which should precede
+ * (or be equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the tuple
+ * inserted.
+ */
+ if (concurrent &&
+ (vis == HEAPTUPLE_INSERT_IN_PROGRESS ||
+ !accept_tuple_for_concurrent_copy(tuple, snapshot, buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /*
+ * In the concurrent case, we should not unlock the buffer until the
+ * tuple has been copied to the new file: if a concurrent transaction
+ * marked it updated or deleted in between, we'd fail to replay that
+ * transaction's changes because then we'd try to perform the same
+ * UPDATE / DELETE twice. XXX Should we instead create a copy of the
+ * tuple so that the buffer can be unlocked right away?
+ */
+ if (!concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -916,7 +954,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -931,6 +969,35 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /* See the comment on unlocking above. */
+ if (concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -974,7 +1041,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2579,6 +2646,56 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Check if the tuple was inserted, updated or deleted while
+ * heapam_relation_copy_for_cluster() was copying the data.
+ *
+ * 'snapshot' is used to determine whether xmin/xmax was set by a transaction
+ * that is still in-progress, or one that started in the future from the
+ * snapshot perspective.
+ *
+ * Returns true if the insertion is visible to 'snapshot', but clear xmax if
+ * it was set by a transaction which is in-progress or in the future from the
+ * snapshot perspective. (The xmax will be set later, when we decode the
+ * corresponding UPDATE / DELETE from WAL.)
+ *
+ * Returns false if the insertion is not visible to 'snapshot'.
+ */
+static bool
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple should be rejected because it was inserted
+ * concurrently.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return false;
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple.
+ */
+ return true;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 33759056e3..aab2712794 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1415,22 +1415,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1469,6 +1454,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7fd5d256a1..3b6419f878 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1236,16 +1236,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bedc177ce4..77511109ce 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,184 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by CLUSTER CONCURRENTLY by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ Oid reltoastrelid;
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lockmode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(Oid relid, bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +289,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +303,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +313,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +391,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +430,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
{
@@ -246,7 +439,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lockmode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +456,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lockmode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +477,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lockmode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +513,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +536,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +607,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +622,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +633,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +644,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +660,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +691,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +706,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +720,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(tableOid, !success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +905,100 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations.", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too.",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is only allowed for permanent relations.",
+ stmt)));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,11 +1007,76 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
- LOCKMODE lockmode_new;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode_new;
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -654,7 +1084,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -662,42 +1091,63 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
* NoLock for the old heap because we already have it locked and want to
* keep unlocking straightforward.
*/
- lockmode_new = AccessExclusiveLock;
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- NoLock, &lockmode_new);
- Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
/* Lock iff not done above. */
- NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1298,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1330,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1357,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1438,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtransaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1505,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1516,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1971,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +2003,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2282,1884 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /* Avoid logical decoding of other relations. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(Oid relid, bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+
+ /* Remove the relation from the hash. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Disable cluster_before_shmem_exit_callback(). */
+ if (OidIsValid(clustered_rel))
+ clustered_rel = InvalidOid;
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ else
+ key.relid = InvalidOid;
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it did for any reason, make sure the transaction is
+ * aborted: if other transactions, while changing the contents of the
+ * relation, didn't know that CLUSTER CONCURRENTLY was in progress, they
+ * could have missed to WAL enough information, and thus we could have
+ * produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Missing TOAST relation indicates that it could have been VACUUMed
+ * or CLUSTERed by another backend while we did not hold a lock on it.
+ */
+ if (entry_toast == NULL && OidIsValid(key.relid))
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(clustered_rel, true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /*
+ * TOAST should not really change, but be careful. If it did, we would be
+ * unable to remove the new one from ClusteredRelsHash.
+ */
+ if (OidIsValid(clustered_rel_toast) &&
+ clustered_rel_toast != reltoastrelid)
+ ereport(ERROR,
+ (errmsg("TOAST relation changed by another transaction")));
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->reltoastrelid = reltoastrelid;
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != cat_state->reltoastrelid)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change->kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 7da6647f8f..6143f854eb 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -906,7 +906,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2b20b03224..2e981b604a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4391,6 +4391,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5803,6 +5813,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32068b5d5..359fbabd5d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1929,10 +1949,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1947,7 +1971,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2010,10 +2035,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2028,6 +2054,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2101,19 +2160,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2166,6 +2212,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2193,11 +2263,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2246,13 +2327,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 78c5726814..0f9141a4ac 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..b3fb5d1825 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ XLogRecGetBlockTag(r, 0, &locator, NULL, NULL);
+
+ if (!RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4923e35e92..4492e2ae46 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -625,6 +625,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..c6baca1171
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,277 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e..4a3c5c8fdc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/waitlsn.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c9..04e7571e70 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index cc9b4cf0dc..0ba35a847e 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6..8b9dfe865b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 63efc55f09..c160051b2f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b8b500f48f..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -156,7 +156,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -625,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a7ccde6d7d..57acf2a279 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2773,7 +2773,7 @@ psql_completion(const char *text, int start, int end)
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -4744,7 +4744,8 @@ psql_completion(const char *text, int start, int end)
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..8687ec8796 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -405,6 +405,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7d434f8e65..77d522561b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..959899a7cc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,101 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +140,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index ad06e80784..b38eb0d530 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 934ba84f6a..cac3d7f8c7 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd..cff17a6bd0 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index e7ac89f484..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -69,6 +69,8 @@ extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a1626f3fae..9a43db2722 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1958,17 +1958,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v03-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 9597a99562ef05a24794dbe9262c29b4a469f6e2 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:28 +0200
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
12 files changed, 386 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1fdcc0abee..69bf4d1c8d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2630,7 +2632,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2687,11 +2690,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2708,6 +2711,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2933,7 +2937,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3001,8 +3006,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3023,6 +3032,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3112,10 +3130,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3154,12 +3173,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3199,6 +3217,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3987,8 +4006,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3998,7 +4021,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4231,10 +4255,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8363,7 +8387,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8374,10 +8399,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4ddb1c4a0c..a8999a3e72 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c..159d2c7983 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5627,6 +5656,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 77511109ce..34cb588a1e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -202,6 +202,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2995,6 +2996,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3147,6 +3151,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw;
ConcurrentChange *change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3215,8 +3220,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3227,6 +3254,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3239,11 +3268,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change->kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change->kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3263,10 +3295,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3274,6 +3326,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3284,6 +3337,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3301,18 +3356,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3322,6 +3395,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3332,7 +3406,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3350,7 +3439,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3358,7 +3447,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3430,6 +3519,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index b3fb5d1825..1f30e12537 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -927,13 +990,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4492e2ae46..8e1f4bb851 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -294,7 +294,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -491,12 +491,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -564,6 +569,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -600,7 +606,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -641,7 +647,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -775,7 +781,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -855,7 +861,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1224,7 +1230,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -2062,7 +2068,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index c6baca1171..db6a2bcf1f 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -257,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = (char *) change + sizeof(ConcurrentChange);
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -267,6 +318,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 8687ec8796..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 959899a7cc..61ea314399 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -99,6 +107,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -119,6 +129,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
--
2.45.2
v03-0006-Add-regression-tests.patchtext/x-diffDownload
From 6cc3f2c4b1d68215cac7d321b617488ee27973e1 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:28 +0200
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 34cb588a1e..d0debe0333 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3742,6 +3743,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 1c1c2d0b13..4a133aad9e 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index c9e357f644..7739b28c19 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -42,7 +42,10 @@ tests += {
'isolation': {
'specs': [
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v03-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From c39ccdacfa64cafe840fd466022f39dc83ab21a2 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:28 +0200
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 293 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..0b55028b79 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10566,6 +10566,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index d8c3edb432..182e4f7592 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -141,7 +141,14 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>CLUSTER</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8999a3e72..61b8d7e8e5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -998,7 +998,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index d0debe0333..1648269f6d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -190,7 +201,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -207,13 +219,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3048,7 +3062,8 @@ get_changed_tuple(ConcurrentChange *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3086,6 +3101,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3128,7 +3146,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3158,6 +3177,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3282,10 +3304,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3488,11 +3522,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3501,10 +3539,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3516,7 +3563,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3526,6 +3573,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3686,6 +3755,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = LOCK_CLUSTER_CONCURRENT;
@@ -3765,7 +3836,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3887,9 +3959,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..02d3805475 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2772,6 +2773,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..9dc060c59f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -724,6 +724,7 @@
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 61ea314399..5d904ce985 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
/*
* Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
*
@@ -149,7 +151,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
const char *stmt);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
v03-0008-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From e0ad5b2f572bbe1e0e10b40043ae124bce2673a5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 4 Sep 2024 12:29:28 +0200
Subject: [PATCH 8/8] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 194 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 61b8d7e8e5..c39a9ac41d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,7 +731,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 09ef220449..86881e8638 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1648269f6d..389e0ff184 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -202,17 +203,21 @@ static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -226,7 +231,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3147,7 +3153,7 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3221,7 +3227,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3229,7 +3236,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3271,11 +3278,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3328,9 +3347,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3340,6 +3362,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3355,6 +3380,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3388,15 +3421,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3406,7 +3446,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3416,6 +3456,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3439,8 +3483,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3449,7 +3494,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3530,7 +3578,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3563,7 +3612,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3748,6 +3798,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3833,11 +3884,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3959,6 +4025,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* cluster_max_xlock_time is not exceeded.
@@ -3986,11 +4057,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1f30e12537..3c9ab8fa61 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -987,11 +987,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1016,6 +1018,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1037,11 +1046,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1058,6 +1070,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1067,6 +1080,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1085,6 +1105,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1103,11 +1131,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1139,6 +1168,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index db6a2bcf1f..54a7e3ca68 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -169,7 +169,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -187,10 +188,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -203,7 +204,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -237,13 +239,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -308,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 5d904ce985..69a9aba050 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -76,6 +76,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..009bbaa1fa 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
While trying to figure out why the regression tests fail sometimes on the
cfbot (not sure yet about the reason), I fixed some confusions in
begin_concurrent_cluster() and end_concurrent_cluster(). Those functions
should be a bit easier to understand now.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v04-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From 4568b1a8ac7f74f54720a413fd8cd9cda1be32a7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 1/8] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..bedc177ce4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lockmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lockmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lockmode_new);
+ Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index b2457f121a..7da6647f8f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -318,7 +318,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
/* Generate the data, if wanted. */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3cc6f8f69..2b20b03224 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5783,7 +5783,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..d32068b5d5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2193,15 +2193,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v04-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From e0fbd580cbe3f3f6ba41d40638aa4c57c4830d23 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index c78c5eb507..cc9b4cf0dc 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 34a55e2177..2b77fd8526 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97dc09ac0d..a005b746df 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 97874300c3..335faafcef 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v04-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 8d1e2c93f4da77b298b14054f9ea8429ab699d53 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..4923e35e92 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -579,10 +579,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -624,6 +621,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -634,7 +656,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -642,7 +664,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -659,11 +681,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..b8b500f48f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,7 +155,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -570,7 +569,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index caa5113ff8..ad06e80784 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..e7ac89f484 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,7 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v04-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From 93f81b25107fba231460115063782232238ab124 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 111 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 141 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2576 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/meson.build | 1 +
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 277 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 104 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
41 files changed, 3564 insertions(+), 204 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..ee26b03a05 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5640,14 +5640,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5728,6 +5749,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..d8c3edb432 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,102 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9857b35627..298cf7298d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91b20147a0..1fdcc0abee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2079,8 +2079,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d4..4ddb1c4a0c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static bool accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -682,6 +686,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -702,6 +708,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -782,6 +790,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -836,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -852,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -871,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -885,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -897,9 +905,39 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding. INSERT_IN_PROGRESS is rejected right away because
+ * our snapshot should represent a point in time which should precede
+ * (or be equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the tuple
+ * inserted.
+ */
+ if (concurrent &&
+ (vis == HEAPTUPLE_INSERT_IN_PROGRESS ||
+ !accept_tuple_for_concurrent_copy(tuple, snapshot, buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /*
+ * In the concurrent case, we should not unlock the buffer until the
+ * tuple has been copied to the new file: if a concurrent transaction
+ * marked it updated or deleted in between, we'd fail to replay that
+ * transaction's changes because then we'd try to perform the same
+ * UPDATE / DELETE twice. XXX Should we instead create a copy of the
+ * tuple so that the buffer can be unlocked right away?
+ */
+ if (!concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -916,7 +954,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -931,6 +969,35 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /* See the comment on unlocking above. */
+ if (concurrent)
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -974,7 +1041,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2579,6 +2646,56 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Check if the tuple was inserted, updated or deleted while
+ * heapam_relation_copy_for_cluster() was copying the data.
+ *
+ * 'snapshot' is used to determine whether xmin/xmax was set by a transaction
+ * that is still in-progress, or one that started in the future from the
+ * snapshot perspective.
+ *
+ * Returns true if the insertion is visible to 'snapshot', but clear xmax if
+ * it was set by a transaction which is in-progress or in the future from the
+ * snapshot perspective. (The xmax will be set later, when we decode the
+ * corresponding UPDATE / DELETE from WAL.)
+ *
+ * Returns false if the insertion is not visible to 'snapshot'.
+ */
+static bool
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple should be rejected because it was inserted
+ * concurrently.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return false;
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple.
+ */
+ return true;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 33759056e3..aab2712794 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1415,22 +1415,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1469,6 +1454,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7fd5d256a1..3b6419f878 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1236,16 +1236,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bedc177ce4..b5698c9baf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,183 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lockmode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +288,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +302,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +312,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +390,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +429,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
{
@@ -246,7 +438,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lockmode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +455,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lockmode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +476,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lockmode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +512,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +535,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +606,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +621,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +632,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +643,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +659,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +690,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +705,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +719,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(!success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +904,100 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations.", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too.",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is only allowed for permanent relations.",
+ stmt)));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,11 +1006,76 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
- LOCKMODE lockmode_new;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode_new;
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -654,7 +1083,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -662,42 +1090,63 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
* NoLock for the old heap because we already have it locked and want to
* keep unlocking straightforward.
*/
- lockmode_new = AccessExclusiveLock;
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- NoLock, &lockmode_new);
- Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
/* Lock iff not done above. */
- NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1297,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1329,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1356,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1437,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtransaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1504,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1515,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1970,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +2002,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2281,1877 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /* Avoid logical decoding of other relations by this backend. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+ Oid relid = clustered_rel;
+ Oid toastrelid = clustered_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(clustered_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = clustered_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ clustered_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that CLUSTER CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != clustered_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change->kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 7da6647f8f..6143f854eb 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -906,7 +906,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2b20b03224..2e981b604a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4391,6 +4391,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5803,6 +5813,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32068b5d5..359fbabd5d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1929,10 +1949,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1947,7 +1971,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2010,10 +2035,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2028,6 +2054,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2101,19 +2160,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2166,6 +2212,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2193,11 +2263,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2246,13 +2327,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 78c5726814..0f9141a4ac 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..752deb39f7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4923e35e92..4492e2ae46 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -625,6 +625,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..c6baca1171
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,277 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e..4a3c5c8fdc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/waitlsn.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c9..04e7571e70 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index cc9b4cf0dc..0ba35a847e 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6..8b9dfe865b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 63efc55f09..c160051b2f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b8b500f48f..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -156,7 +156,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -625,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a7ccde6d7d..57acf2a279 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2773,7 +2773,7 @@ psql_completion(const char *text, int start, int end)
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -4744,7 +4744,8 @@ psql_completion(const char *text, int start, int end)
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9e9aec88a6..8687ec8796 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -405,6 +405,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 7d434f8e65..77d522561b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..959899a7cc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,101 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +140,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index ad06e80784..b38eb0d530 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 934ba84f6a..cac3d7f8c7 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd..cff17a6bd0 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index e7ac89f484..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -69,6 +69,8 @@ extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a1626f3fae..9a43db2722 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1958,17 +1958,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v04-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 66bce6a7fdc63a1072ffd9c86ac0855edd0094f7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
12 files changed, 386 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1fdcc0abee..69bf4d1c8d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2630,7 +2632,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2687,11 +2690,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2708,6 +2711,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2933,7 +2937,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3001,8 +3006,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3023,6 +3032,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3112,10 +3130,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3154,12 +3173,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3199,6 +3217,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3987,8 +4006,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3998,7 +4021,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4231,10 +4255,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8363,7 +8387,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8374,10 +8399,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4ddb1c4a0c..a8999a3e72 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c..159d2c7983 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5627,6 +5656,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b5698c9baf..23e40562bd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -201,6 +201,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2987,6 +2988,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3139,6 +3143,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw;
ConcurrentChange *change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3207,8 +3212,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3219,6 +3246,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3231,11 +3260,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change->kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change->kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3255,10 +3287,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3266,6 +3318,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3276,6 +3329,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3293,18 +3348,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3314,6 +3387,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3324,7 +3398,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3342,7 +3431,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3350,7 +3439,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3422,6 +3511,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 752deb39f7..7526c1a381 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -927,13 +990,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4492e2ae46..8e1f4bb851 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -294,7 +294,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -491,12 +491,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -564,6 +569,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -600,7 +606,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -641,7 +647,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -775,7 +781,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -855,7 +861,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1224,7 +1230,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -2062,7 +2068,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index c6baca1171..db6a2bcf1f 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -257,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = (char *) change + sizeof(ConcurrentChange);
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -267,6 +318,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 8687ec8796..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 959899a7cc..61ea314399 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -99,6 +107,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -119,6 +129,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
--
2.45.2
v04-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 66bce6a7fdc63a1072ffd9c86ac0855edd0094f7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
12 files changed, 386 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1fdcc0abee..69bf4d1c8d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -75,7 +75,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
Bitmapset *interesting_cols,
Bitmapset *external_cols,
@@ -1975,7 +1976,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1991,15 +1992,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2630,7 +2632,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2687,11 +2690,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2708,6 +2711,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2933,7 +2937,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3001,8 +3006,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3023,6 +3032,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3112,10 +3130,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3154,12 +3173,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3199,6 +3217,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3987,8 +4006,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3998,7 +4021,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4231,10 +4255,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8363,7 +8387,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8374,10 +8399,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4ddb1c4a0c..a8999a3e72 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c..159d2c7983 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5627,6 +5656,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b5698c9baf..23e40562bd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -201,6 +201,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2987,6 +2988,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3139,6 +3143,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw;
ConcurrentChange *change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3207,8 +3212,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3219,6 +3246,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3231,11 +3260,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change->kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change->kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3255,10 +3287,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3266,6 +3318,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3276,6 +3329,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3293,18 +3348,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3314,6 +3387,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3324,7 +3398,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3342,7 +3431,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3350,7 +3439,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3422,6 +3511,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 752deb39f7..7526c1a381 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -927,13 +990,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4492e2ae46..8e1f4bb851 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -294,7 +294,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -491,12 +491,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -564,6 +569,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -600,7 +606,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -641,7 +647,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -775,7 +781,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -855,7 +861,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1224,7 +1230,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -2062,7 +2068,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index c6baca1171..db6a2bcf1f 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -257,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = (char *) change + sizeof(ConcurrentChange);
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -267,6 +318,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 8687ec8796..e87eb2f861 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,21 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..e0016631f6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -476,6 +476,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 959899a7cc..61ea314399 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -99,6 +107,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -119,6 +129,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
--
2.45.2
v04-0006-Add-regression-tests.patchtext/x-diffDownload
From 5c12f131b3ec3876fe903b1a4f4123b7792a55f6 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 23e40562bd..87f7106731 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3734,6 +3735,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 1c1c2d0b13..4a133aad9e 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index c9e357f644..7739b28c19 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -42,7 +42,10 @@ tests += {
'isolation': {
'specs': [
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v04-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From d3cb8bbd1d7b695c82b090f820fbc9f748e9fcff Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 293 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..0b55028b79 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10566,6 +10566,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index d8c3edb432..182e4f7592 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -141,7 +141,14 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>CLUSTER</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8999a3e72..61b8d7e8e5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -998,7 +998,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 87f7106731..a2c072a223 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -189,7 +200,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -206,13 +218,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3040,7 +3054,8 @@ get_changed_tuple(ConcurrentChange *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3078,6 +3093,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3120,7 +3138,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3150,6 +3169,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3274,10 +3296,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3480,11 +3514,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3493,10 +3531,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3508,7 +3555,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3518,6 +3565,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3678,6 +3747,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = LOCK_CLUSTER_CONCURRENT;
@@ -3757,7 +3828,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3879,9 +3951,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..02d3805475 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2772,6 +2773,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..9dc060c59f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -724,6 +724,7 @@
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 61ea314399..5d904ce985 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
/*
* Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
*
@@ -149,7 +151,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
const char *stmt);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
v04-0008-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 5b6db1efcd1caa9a96ecfcec475be429bc901f18 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 6 Sep 2024 09:55:54 +0200
Subject: [PATCH 8/8] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 194 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 61b8d7e8e5..c39a9ac41d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,7 +731,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 09ef220449..86881e8638 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a2c072a223..97e3e57305 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -201,17 +202,21 @@ static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -225,7 +230,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3139,7 +3145,7 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3213,7 +3219,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3221,7 +3228,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3263,11 +3270,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3320,9 +3339,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3332,6 +3354,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3347,6 +3372,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3380,15 +3413,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3398,7 +3438,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3408,6 +3448,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3431,8 +3475,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3441,7 +3486,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3522,7 +3570,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3555,7 +3604,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3740,6 +3790,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3825,11 +3876,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3951,6 +4017,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* cluster_max_xlock_time is not exceeded.
@@ -3978,11 +4049,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7526c1a381..398aa0c77d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -987,11 +987,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1016,6 +1018,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1037,11 +1046,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1058,6 +1070,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1067,6 +1080,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1085,6 +1105,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1103,11 +1131,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1139,6 +1168,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index db6a2bcf1f..54a7e3ca68 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -169,7 +169,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -187,10 +188,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -203,7 +204,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -237,13 +239,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -308,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 5d904ce985..69a9aba050 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -76,6 +76,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..009bbaa1fa 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
On Wed, Sep 4, 2024 at 7:41 PM Antonin Houska <ah@cybertec.at> wrote:
Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for working on this, I think this is a very useful feature.
The patch doesn't compile in the debug build with errors:
../postgres/src/backend/commands/cluster.c: In function ‘get_catalog_state’:
../postgres/src/backend/commands/cluster.c:2771:33: error: declaration
of ‘td_src’ shadows a previous local [-Werror=shadow=compatible-local]
2771 | TupleDesc td_src, td_dst;
| ^~~~~~
../postgres/src/backend/commands/cluster.c:2741:25: note: shadowed
declaration is here
2741 | TupleDesc td_src = RelationGetDescr(rel);ok, gcc14 complains here, the compiler I used before did not. Fixed.
you forgot the meson build for pgoutput_cluster
diff --git a/src/backend/meson.build b/src/backend/meson.build index 78c5726814..0f9141a4ac 100644 --- a/src/backend/meson.build +++ b/src/backend/meson.build @@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + { subdir('jit/llvm') subdir('replication/libpqwalreceiver') subdir('replication/pgoutput') +subdir('replication/pgoutput_cluster')Fixed, thanks. That might be the reason for the cfbot to fail when using
meson.I noticed that you use lmode/lock_mode/lockmode, there are lmode and lockmode
in the codebase, but I remember someone proposed all changes to lockmode, how
about sticking to lockmode in your patch?Fixed.
0004:
+ sure that the old files do not change during the processing because the + chnages would get lost due to the swap. typoFixed.
+ files. The data changes that took place during the creation of the new + table and index files are captured using logical decoding + (<xref linkend="logicaldecoding"/>) and applied before + the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock + is typically held only for the time needed to swap the files, which + should be pretty short.I remember pg_squeeze also did some logical decoding after getting the exclusive
lock, if that is still true, I guess the doc above is not precise.The decoding takes place before requesting the lock, as well as after
that. I've adjusted the paragraph, see 0007.+ Note that <command>CLUSTER</command> with the + the <literal>CONCURRENTLY</literal> option does not try to order the + rows inserted into the table after the clustering started.Do you mean after the *logical decoding* started here? If CLUSTER CONCURRENTLY
does not order rows at all, why bother implementing it?The rows inserted before CLUSTER (CONCURRENTLY) started do get ordered, the
rows inserted after that do not. (Actually what matters is when the snapshot
for the initial load is created, but that happens in very early stage of the
processing. Not sure if user is interested in such implementation details.)+ errhint("CLUSTER CONCURRENTLY is only allowed for permanent relations")));
errhint messages should end with a dot. Why hardcoded to "CLUSTER CONCURRENTLY"
instead of parameter *stmt*.Fixed.
+ ResourceOwner oldowner = CurrentResourceOwner; + + /* + * In the CONCURRENT case, do the planning in a subtrensaction so that typoFixed.
I did not see VacuumStmt changes in gram.y, how do we suppose to
use the vacuum full concurrently? I tried the following but no success.With the "parethesized syntax", new options can be added w/o changing
gram.y. (While the "unparenthesized syntax" is deprecated.)[local] postgres@demo:5432-36097=# vacuum (concurrently) aircrafts_data;
ERROR: CONCURRENTLY can only be specified with VACUUM FULLThe "lazy" VACUUM works concurrently as such.
[local] postgres@demo:5432-36097=# vacuum full (concurrently) full
aircrafts_data;
ERROR: syntax error at or near "("
LINE 1: vacuum full (concurrently) full aircrafts_data;This is not specific to the CONCURRENTLY option. For example:
postgres=3D# vacuum full (analyze) full aircrafts_data;
ERROR: syntax error at or near "("
LINE 1: vacuum full (analyze) full aircrafts_data;(You seem to combine the parenthesized syntax with the unparenthesized.)
Yeah, my mistake, *vacuum (full, concurrently)* works.
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) &&
+ HeapTupleMVCCNotDeleted(tuple, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(tuple->t_data, 0);
+ }
I don't quite understand the above code, IIUC xmax and xmax invalid
are set directly on the buffer page. What if the command failed? Will
this break the visibility rules?
btw, v4-0006 failed to apply.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
--
Regards
Junwang Zhao
Junwang Zhao <zhjwpku@gmail.com> wrote:
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(tuple->t_data)) && + HeapTupleMVCCNotDeleted(tuple, snapshot, buffer)) + { + /* TODO More work needed here?*/ + tuple->t_data->t_infomask |= HEAP_XMAX_INVALID; + HeapTupleHeaderSetXmax(tuple->t_data, 0); + }I don't quite understand the above code, IIUC xmax and xmax invalid
are set directly on the buffer page. What if the command failed? Will
this break the visibility rules?
Oh, that's too bad. Of course, the tuple must be copied. Fixed in the next
version. Thanks!
(0009- added to get some debugging information from the cfbot. It fails
sometimes and I'm not able to reproduce the errors.)
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v05-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From 69ea081fe2d14e4a227d1493a1f73def3facb8d4 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 1/8] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 146 ++++++++++++++++++-------------
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/vacuum.c | 12 +--
src/include/commands/cluster.h | 5 +-
5 files changed, 99 insertions(+), 68 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78f96789b0..bedc177ce4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -70,8 +70,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -194,11 +194,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, indexOid, ¶ms);
return;
}
@@ -275,6 +275,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -282,8 +283,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /*
+ * Do the job. (The function will close the relation, lock is kept
+ * till commit.)
+ */
+ cluster_rel(rel, rtc->indexOid, params);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -306,16 +312,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -328,21 +337,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -445,7 +439,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -474,9 +472,12 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
+ rebuild_relation(OldHeap, index, verbose);
- /* NB: rebuild_relation does table_close() on OldHeap */
+ /*
+ * NB: rebuild_relation does table_close() on OldHeap, and also on index,
+ * if the pointer is valid.
+ */
out:
/* Roll back any GUC changes executed by index functions */
@@ -625,22 +626,27 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
+ Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ LOCKMODE lockmode_new;
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -650,19 +656,40 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward.
+ */
+ lockmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock, &lockmode_new);
+ Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ /* Lock iff not done above. */
+ NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so we could unlock it
+ * completely, but it's simpler to pass NoLock than to track all the locks
+ * acquired so far.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -683,10 +710,15 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
*
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
+ *
+ * If a specific lock mode is needed for the new relation, pass it via the
+ * in/out parameter lockmode_new_p. On exit, the output value tells whether
+ * the lock was actually acquired.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode)
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p)
{
TupleDesc OldHeapDesc;
char NewHeapName[NAMEDATALEN];
@@ -697,8 +729,17 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Datum reloptions;
bool isNull;
Oid namespaceid;
+ LOCKMODE lockmode_new;
- OldHeap = table_open(OIDOldHeap, lockmode);
+ if (lockmode_new_p)
+ {
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
+ }
+ else
+ lockmode_new = lockmode_old;
+
+ OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
/*
@@ -792,7 +833,9 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
if (isNull)
reloptions = (Datum) 0;
- NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode, toastid);
+ NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
+ if (lockmode_new_p)
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -811,13 +854,13 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
+ Oid OIDOldHeap = RelationGetRelid(OldHeap);
+ Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
+ Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -836,16 +879,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -1001,11 +1034,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 010097873d..8eaf951cc1 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -318,7 +318,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
*/
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock);
+ relpersistence, ExclusiveLock, NULL);
LockRelationOid(OIDNewHeap, AccessExclusiveLock);
/* Generate the data, if wanted. */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index af8c05b91f..cd6125ba39 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5839,7 +5839,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* unlogged anyway.
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ persistence, lockmode, NULL);
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9c25..d000422800 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2218,15 +2218,17 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+
+ /*
+ * cluster_rel() should have closed the relation, lock is kept
+ * till commit.
+ */
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..7492796ea2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,13 +32,14 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
- char relpersistence, LOCKMODE lockmode);
+ char relpersistence, LOCKMODE lockmode_old,
+ LOCKMODE *lockmode_new_p);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool is_system_catalog,
bool swap_toast_by_content,
--
2.45.2
v05-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From 9dba024c6de20151ad2ce31c94ecc76287bbc1bd Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index c78c5eb507..cc9b4cf0dc 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -33,9 +33,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -56,7 +56,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -77,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -134,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -155,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 34a55e2177..2b77fd8526 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -378,8 +378,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7b50e0b5a..b7d175f4b8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 97874300c3..335faafcef 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v05-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 59f4425f176c17cc67421271dd30fdef96dd6e72 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0450f94ba8..4923e35e92 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -579,10 +579,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -624,6 +621,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -634,7 +656,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -642,7 +664,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -659,11 +681,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f2..b8b500f48f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -155,7 +155,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -570,7 +569,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index caa5113ff8..ad06e80784 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..e7ac89f484 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -68,6 +68,7 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v05-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From 21378d2109f0731c4a7559fbfcf587372e646849 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 111 +-
doc/src/sgml/ref/vacuum.sgml | 27 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2576 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 137 +-
src/backend/meson.build | 1 +
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 277 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 104 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 2 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
41 files changed, 3568 insertions(+), 204 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 331315f8d3..5205f2026b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5670,14 +5670,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5758,6 +5779,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..d8c3edb432 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,18 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, if
+ the <literal>CONCURRENTLY</literal> option is used with this form, system
+ catalogs and <acronym>TOAST</acronym> tables are not processed.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +116,102 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9110938fab..f1008f5013 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<para>
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
- in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ in the current database that the current user has permission to vacuum. If
+ the <literal>CONCURRENTLY</literal> is specified (see below), tables which
+ have not been clustered yet are silently skipped. With a
+ list, <command>VACUUM</command> processes only those table(s). If
+ the <literal>CONCURRENTLY</literal> is specified, the list may only contain
+ tables which have already been clustered.
</para>
<para>
@@ -360,6 +365,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index da5e656a08..229fefed14 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2070,8 +2070,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d4..6cba141c11 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -682,6 +686,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -702,6 +708,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -780,8 +788,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -836,7 +846,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -852,14 +862,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -871,7 +882,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -885,8 +896,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -897,9 +906,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -916,7 +963,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -931,6 +978,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -974,7 +1048,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2579,6 +2653,53 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6084dfa97c..cafd37917b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1416,22 +1416,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1470,6 +1455,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3456b821bc..93030b9527 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1238,16 +1238,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bedc177ce4..b5698c9baf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
@@ -40,10 +45,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -57,6 +67,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -68,17 +80,183 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lockmode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(ConcurrentChange *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -110,10 +288,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -122,6 +302,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -130,20 +312,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -198,7 +390,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* Do the job. (The function will close the relation, lock is kept
* till commit.)
*/
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel);
return;
}
@@ -237,7 +429,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
{
@@ -246,7 +438,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lockmode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -263,7 +455,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lockmode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -283,13 +476,19 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
- cluster_rel(rel, rtc->indexOid, params);
+ /* Not all relations cannot be processed in the concurrent mode. */
+ if ((params->options & CLUOPT_CONCURRENT) == 0 ||
+ check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "CLUSTER (CONCURRENTLY)"))
+ {
+ /* Do the job. (The function will close the relation, lock is kept
+ * till commit.) */
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel);
+ }
+ else
+ table_close(rel, lockmode);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -313,10 +512,21 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
+ * We expect that OldHeap is already locked. The lock mode is
+ * AccessExclusiveLock for normal processing and LOCK_CLUSTER_CONCURRENT for
+ * concurrent processing (so that SELECT, INSERT, UPDATE and DELETE commands
+ * work, but cluster_rel() cannot be called concurrently for the same
+ * relation).
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -325,6 +535,41 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /* Check that the correct lock is held. */
+ lmode = !concurrent ? AccessExclusiveLock : LOCK_CLUSTER_CONCURRENT;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
+
+ check_relation_is_clusterable_concurrently(OldHeap, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -361,7 +606,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -376,7 +621,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -387,7 +632,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -398,7 +643,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -414,6 +659,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -440,7 +690,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -455,7 +705,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -468,11 +719,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose,
+ (params->options & CLUOPT_CONCURRENT) != 0);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(!success);
+ }
+ PG_END_TRY();
/*
* NB: rebuild_relation does table_close() on OldHeap, and also on index,
@@ -622,18 +904,100 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+bool
+check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for catalog relations.", stmt)));
+ return false;
+ }
+
+ if (IsToastRelation(rel))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is not supported for TOAST relations, unless the main relation is processed too.",
+ stmt)));
+ return false;
+ }
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s is only allowed for permanent relations.",
+ stmt)));
+ return false;
+ }
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+ return false;
+ }
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+ return false;
+ }
+
+ return true;
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
@@ -642,11 +1006,76 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
- LOCKMODE lockmode_new;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode_new;
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we setup the logical decoding,
+ * because that can take some time. Not sure if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT,
+ LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
if (OidIsValid(indexOid))
/* Mark the correct index as clustered */
@@ -654,7 +1083,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -662,42 +1090,63 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
* NoLock for the old heap because we already have it locked and want to
* keep unlocking straightforward.
*/
- lockmode_new = AccessExclusiveLock;
+ lmode_new = AccessExclusiveLock;
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- NoLock, &lockmode_new);
- Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
+ NoLock, &lmode_new);
+ Assert(lmode_new == AccessExclusiveLock || lmode_new == NoLock);
/* Lock iff not done above. */
- NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
+ NewHeap = table_open(OIDNewHeap, lmode_new == NoLock ?
AccessExclusiveLock : NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so we could
+ * unlock it completely, but it's simpler to pass NoLock than to track
+ * all the lock acquired so far.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -848,15 +1297,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Oid OIDOldHeap = RelationGetRelid(OldHeap);
Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
@@ -876,6 +1329,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -902,8 +1356,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -979,7 +1437,45 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtransaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, LOCK_CLUSTER_CONCURRENT,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, LOCK_CLUSTER_CONCURRENT,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -1008,7 +1504,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -1017,7 +1515,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1468,14 +1970,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1501,39 +2002,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1773,3 +2281,1877 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, LOCK_CLUSTER_CONCURRENT);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /* Avoid logical decoding of other relations by this backend. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(bool error)
+{
+ ClusteredRel key, *entry, *entry_toast = NULL;
+ Oid relid = clustered_rel;
+ Oid toastrelid = clustered_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(clustered_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = clustered_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ clustered_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that CLUSTER CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * LOCK_CLUSTER_CONCURRENT, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < LOCK_CLUSTER_CONCURRENT)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) LOCK_CLUSTER_CONCURRENT on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ */
+ check_relation_is_clusterable_concurrently(rel, ERROR,
+ "CLUSTER (CONCURRENTLY)");
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != clustered_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from a change structure. As for the change, no alignment is
+ * assumed.
+ */
+static HeapTuple
+get_changed_tuple(ConcurrentChange *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ memcpy(&tup_data, &change->tup_data, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = (char *) change + sizeof(ConcurrentChange);
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* This is bytea, but char* is easier to work with. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+
+ change = (ConcurrentChange *) VARDATA(change_raw);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change->kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(change);
+
+ if (change->kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change->kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change->kind == CHANGE_UPDATE_NEW ||
+ change->kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change->kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change->kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change->kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = LOCK_CLUSTER_CONCURRENT;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ LOCK_CLUSTER_CONCURRENT, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 8eaf951cc1..b9b72d723e 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -905,7 +905,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd6125ba39..8bd51b71a3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4409,6 +4409,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5859,6 +5869,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d000422800..94dbe5d30a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -112,7 +112,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +154,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +228,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +304,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +384,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -483,6 +493,7 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
const char *stmttype;
volatile bool in_outer_xact,
use_own_xacts;
+ bool whole_database = false;
Assert(params != NULL);
@@ -543,7 +554,15 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
relations = get_all_vacuum_rels(vac_context, params->options);
+ /*
+ * If all tables should be processed, the CONCURRENTLY option implies
+ * that we should skip system relations rather than raising ERRORs.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ whole_database = true;
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel, whole_database))
continue;
}
@@ -1954,10 +1974,14 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
+ *
+ * If whole_database is true, we are processing all the relations of the
+ * current database. In that case we might need to silently skip
+ * relations which could otherwise cause ERROR.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1972,7 +1996,8 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel,
+ bool whole_database)
{
LOCKMODE lmode;
Relation rel;
@@ -2035,10 +2060,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2053,6 +2079,39 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Leave if the CONCURRENTLY option was passed, but the relation is not
+ * suitable for that. Note that we only skip such relations if the user
+ * wants to vacuum the whole database. In contrast, if he specified
+ * inappropriate relation(s) explicitly, the command will end up with
+ * ERROR.
+ */
+ if (whole_database && (params->options & VACOPT_FULL_CONCURRENT) &&
+ !check_relation_is_clusterable_concurrently(rel, DEBUG1,
+ "VACUUM (FULL, CONCURRENTLY)"))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= LOCK_CLUSTER_CONCURRENT);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2126,19 +2185,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2191,6 +2237,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2218,11 +2288,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel);
/*
* cluster_rel() should have closed the relation, lock is kept
@@ -2271,13 +2352,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel, whole_database);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 78c5726814..0f9141a4ac 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d687ceee33..752deb39f7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4923e35e92..4492e2ae46 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -625,6 +625,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..c6baca1171
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,277 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange *change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + sizeof(ConcurrentChange);
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ change = (ConcurrentChange *) VARDATA(change_raw);
+ change->kind = kind;
+
+ /* No other information is needed for TRUNCATE. */
+ if (change->kind == CHANGE_TRUNCATE)
+ goto store;
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change->tup_data, tuple, sizeof(HeapTupleData));
+ dst = (char *) change + sizeof(ConcurrentChange);
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 10fc18f252..4d44e89faf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/waitlsn.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -345,6 +347,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c9..04e7571e70 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1299,6 +1299,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index cc9b4cf0dc..0ba35a847e 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6..8b9dfe865b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 603aa4157b..5a2d5d6138 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1373,6 +1373,28 @@ CacheInvalidateRelcache(Relation relation)
RegisterRelcacheInvalidation(databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c326f687eb..35b4609324 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1257,6 +1258,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b8b500f48f..6be0fef84c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -156,7 +156,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -625,7 +624,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index a9f4d205e1..5d61292c7e 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3104,7 +3104,7 @@ match_previous_words(int pattern_id,
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -5103,7 +5103,8 @@ match_previous_words(int pattern_id,
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b951466ced..e917d387d5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -412,6 +412,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da661289c1..1380ba81fc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1667,6 +1670,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1679,6 +1686,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1687,6 +1696,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..943fe71ba6 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 7492796ea2..959899a7cc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,101 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+/*
+ * Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
+ *
+ * Like for lazy VACUUM, we choose the strongest lock that still allows
+ * INSERT, UPDATE and DELETE.
+ *
+ * Note that the lock needs to be released temporarily a few times during the
+ * processing. In such cases it should be checked after re-locking that the
+ * relation / index hasn't changed in the system catalog while the lock was
+ * not held.
+ */
+#define LOCK_CLUSTER_CONCURRENT ShareUpdateExclusiveLock
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
+ const char *stmt);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
@@ -45,8 +140,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index ad06e80784..b38eb0d530 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -69,6 +69,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 810b297edf..7dfa180fa1 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,7 +36,7 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
* INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd..cff17a6bd0 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 24695facf2..4acf9d0ed9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -42,6 +42,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index e7ac89f484..f58c9108fc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -69,6 +69,8 @@ extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot GetOldestSnapshot(void);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2b47013f11..a307fc79a5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1960,17 +1960,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v05-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 69a9a533c7946fff5f0cfe871a709e70c11d2ce4 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
12 files changed, 386 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 90d0654e62..183055647b 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 229fefed14..842bb3cf7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -58,7 +58,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -1966,7 +1967,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1982,15 +1983,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2621,7 +2623,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2678,11 +2681,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2699,6 +2702,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2924,7 +2928,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -2992,8 +2997,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3014,6 +3023,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3103,10 +3121,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3145,12 +3164,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3190,6 +3208,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3982,8 +4001,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3993,7 +4016,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4348,10 +4372,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8599,7 +8623,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8610,10 +8635,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6cba141c11..21fd4e8977 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c..159d2c7983 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5627,6 +5656,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b5698c9baf..23e40562bd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -201,6 +201,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2987,6 +2988,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3139,6 +3143,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw;
ConcurrentChange *change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3207,8 +3212,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change->snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3219,6 +3246,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3231,11 +3260,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change->kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change->kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change->snapshot->active_count > 0);
+ change->snapshot->active_count--;
+ if (change->snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change->snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change->snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3255,10 +3287,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3266,6 +3318,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3276,6 +3329,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3293,18 +3348,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3314,6 +3387,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3324,7 +3398,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3342,7 +3431,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3350,7 +3439,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3422,6 +3511,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 752deb39f7..7526c1a381 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -927,13 +990,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4492e2ae46..8e1f4bb851 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -294,7 +294,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -491,12 +491,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -564,6 +569,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -600,7 +606,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -641,7 +647,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -775,7 +781,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -855,7 +861,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1224,7 +1230,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -2062,7 +2068,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index c6baca1171..db6a2bcf1f 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -257,6 +303,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = (char *) change + sizeof(ConcurrentChange);
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change->xid = xid;
+ change->snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -267,6 +318,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e917d387d5..a6e3483394 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -317,21 +317,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 42736f37e7..1c5cb7c728 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -103,6 +103,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a..2f9be7afaa 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 959899a7cc..61ea314399 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -71,6 +71,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -99,6 +107,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -119,6 +129,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
--
2.45.2
v05-0006-Add-regression-tests.patchtext/x-diffDownload
From 5381319122394bdc8ecb153382b76180650e4236 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 23e40562bd..87f7106731 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3734,6 +3735,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 8cb8c498e2..c2f7eb3c01 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fdb5a25d7b..890ca012d1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -43,7 +43,10 @@ tests += {
'isolation': {
'specs': [
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v05-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From e9d00f16d8ea47c05e70fbc384f10f4bbadf30f8 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:20 +0200
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 293 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9707d5238d..34f3665014 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10566,6 +10566,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index d8c3edb432..182e4f7592 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -141,7 +141,14 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>CLUSTER</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 21fd4e8977..cae8bd7dea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1005,7 +1005,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 87f7106731..a2c072a223 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -189,7 +200,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -206,13 +218,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3040,7 +3054,8 @@ get_changed_tuple(ConcurrentChange *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3078,6 +3093,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3120,7 +3138,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3150,6 +3169,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3274,10 +3296,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3480,11 +3514,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3493,10 +3531,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3508,7 +3555,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3518,6 +3565,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3678,6 +3747,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = LOCK_CLUSTER_CONCURRENT;
@@ -3757,7 +3828,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3879,9 +3951,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..02d3805475 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2772,6 +2773,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..9dc060c59f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -724,6 +724,7 @@
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 61ea314399..5d904ce985 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
/*
* Lock level for the concurrent variant of CLUSTER / VACUUM FULL.
*
@@ -149,7 +151,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern bool check_relation_is_clusterable_concurrently(Relation rel, int elevel,
const char *stmt);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode_old,
LOCKMODE *lockmode_new_p);
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
v05-0008-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From a10e3f87fb3e8495ccd46013b047594d7d562de5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 9 Oct 2024 09:44:21 +0200
Subject: [PATCH 8/8] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 +++++-----
src/backend/commands/cluster.c | 113 ++++++++++++++----
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 195 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cae8bd7dea..15b9a30e2f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,7 +731,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 09ef220449..86881e8638 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a2c072a223..59343e0290 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -201,17 +202,21 @@ static HeapTuple get_changed_tuple(ConcurrentChange *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -225,7 +230,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -761,8 +767,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
begin_concurrent_cluster(&OldHeap, index_p, &entered);
}
- rebuild_relation(OldHeap, index, verbose,
- (params->options & CLUOPT_CONCURRENT) != 0);
+ rebuild_relation(OldHeap, index, verbose, concurrent);
success = true;
}
PG_FINALLY();
@@ -3139,7 +3144,7 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3213,7 +3218,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3221,7 +3227,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change->kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change->kind == CHANGE_UPDATE_NEW)
{
@@ -3263,11 +3269,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change->old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change->kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, change);
+ apply_concurrent_delete(rel, tup_exist, change, rwstate);
ResetClusterCurrentXids();
@@ -3320,9 +3338,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3332,6 +3353,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3347,6 +3371,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3380,15 +3412,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3398,7 +3437,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3408,6 +3447,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3431,8 +3474,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3441,7 +3485,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3522,7 +3569,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3555,7 +3603,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3740,6 +3789,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3825,11 +3875,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3951,6 +4016,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* cluster_max_xlock_time is not exceeded.
@@ -3978,11 +4048,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7526c1a381..398aa0c77d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -987,11 +987,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1016,6 +1018,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1037,11 +1046,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1058,6 +1070,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1067,6 +1080,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1085,6 +1105,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1103,11 +1131,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1139,6 +1168,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index db6a2bcf1f..54a7e3ca68 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -169,7 +169,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -187,10 +188,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -203,7 +204,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -237,13 +239,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -308,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change->snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change->old_tid);
+ else
+ ItemPointerSetInvalid(&change->old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 5d904ce985..69a9aba050 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -76,6 +76,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e332635f70..009bbaa1fa 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
v05-0009-elog.patchtext/x-diffDownload
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 59343e0290..7100421fd9 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -2467,6 +2467,8 @@ begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
* could make us remove that entry (inserted by another backend) during
* ERROR handling.
*/
+ ereport(LOG, (errmsg("setting clustered_rel to %u, current value is %u",
+ relid, clustered_rel)));
Assert(!OidIsValid(clustered_rel));
clustered_rel = relid;
@@ -2548,6 +2550,14 @@ begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
reopen_relations(rri, nrel);
/* Avoid logical decoding of other relations by this backend. */
+ {
+ RelFileLocator *cur = &clustered_rel_locator;
+ RelFileLocator *new = &rel->rd_locator;
+
+ ereport(LOG, (errmsg("setting clustered_rel_locator to {%u, %u, %u}, the current value is {%u, %u, %u}",
+ new->spcOid, new->dbOid, new->relNumber,
+ cur->spcOid, cur->dbOid, cur->relNumber)));
+ }
clustered_rel_locator = rel->rd_locator;
if (OidIsValid(toastrelid))
{
@@ -2586,6 +2596,10 @@ end_concurrent_cluster(bool error)
* By clearing this variable we also disable
* cluster_before_shmem_exit_callback().
*/
+ ereport(LOG,
+ (errmsg("setting clustered_rel to 0, current value is %u",
+ clustered_rel)));
+
clustered_rel = InvalidOid;
}
@@ -2850,10 +2864,22 @@ check_catalog_changes(Relation rel, CatalogState *cat_state)
* avoid data loss. (The original locators are stored outside cat_state,
* but the check belongs to this function.)
*/
+ {
+ RelFileLocator *exp = &clustered_rel_locator;
+ RelFileLocator *act = &rel->rd_locator;
+
+ ereport(LOG,
+ (errmsg("Expected value of clustered_rel_locator is {%u, %u, %u}, actual value is {%u,%u, %u}",
+ exp->spcOid, exp->dbOid, exp->relNumber,
+ act->spcOid, act->dbOid, act->relNumber)));
+ }
+
if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ {
ereport(ERROR,
(errmsg("file of relation \"%s\" changed by another transaction",
RelationGetRelationName(rel))));
+ }
if (OidIsValid(reltoastrelid))
{
Relation toastrel;
On 2024-Oct-09, Antonin Houska wrote:
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml index 9110938fab..f1008f5013 100644 --- a/doc/src/sgml/ref/vacuum.sgml +++ b/doc/src/sgml/ref/vacuum.sgml
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re <para> Without a <replaceable class="parameter">table_and_columns</replaceable> list, <command>VACUUM</command> processes every table and materialized view - in the current database that the current user has permission to vacuum. - With a list, <command>VACUUM</command> processes only those table(s). + in the current database that the current user has permission to vacuum. If + the <literal>CONCURRENTLY</literal> is specified (see below), tables which + have not been clustered yet are silently skipped. With a + list, <command>VACUUM</command> processes only those table(s). If + the <literal>CONCURRENTLY</literal> is specified, the list may only contain + tables which have already been clustered. </para>
The idea that VACUUM CONCURRENTLY can only process tables that have been
clustered sounds very strange to me. I don't think such a restriction
would really fly. However, I think this may just be a documentation
mistake; can you please clarify? I am tempted to suggest that VACUUM
CONCURRENTLY should receive a table list; without a list, it should
raise an error. This is not supposed to be a routine maintenance
command that you can run on all your tables, after all. Heck, maybe
don't even accept a partitioned table -- the user can process one
partition at a time, if they need that.
I don't believe in the need for the LOCK_CLUSTER_CONCURRENT define; IMO
the code should just use ShareUpdateExclusiveLock where needed.
In 0001, the new API of make_new_heap() is somewhat bizarre regarding
the output lockmode_new_p parameter. I didn't find any place in the
patch series where we use that to return a different lock level that the
caller gave; the only case were we do something that looks funny is when
a toast table is involved. But I don't think I fully understand what is
going on in that case. I'm likely missing something here, but isn't it
simpler to just state that make_new_heap will obtain a lock on the new
heap, and that the immediately following table_open needn't acquire a
lock (or, in the case of RefreshMatViewByOid, no LockRelationOid is
necessary)?
Anyway, I propose some cosmetic cleanups for 0001 in attachment,
including changing make_new_heap to assume a non-null value of
lockmode_new_p. I didn't go as far as making it no longer a pointer,
but if it can be done, then I suggest we should do that. I didn't try
to apply the next patches in the series after this one.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Attachments:
0001-Minor-code-review.patch.nocfbottext/plain; charset=utf-8Download
From eec9e6dfc4aa5a4f52a82065e3d4973cdbbff09f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Fri, 6 Dec 2024 14:39:02 +0100
Subject: [PATCH] Minor code review
---
src/backend/commands/cluster.c | 72 ++++++++++++--------------------
src/backend/commands/matview.c | 7 +++-
src/backend/commands/tablecmds.c | 5 ++-
src/backend/commands/vacuum.c | 5 +--
4 files changed, 36 insertions(+), 53 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e32abf15e69..4a62aff46bd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -191,13 +191,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
stmt->indexname, stmt->relation->relname)));
}
+ /* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
cluster_rel(rel, indexOid, ¶ms);
+ /* cluster_rel closes the relation, but keeps lock */
return;
}
@@ -284,11 +282,9 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
- /*
- * Do the job. (The function will close the relation, lock is kept
- * till commit.)
- */
+ /* Process this table */
cluster_rel(rel, rtc->indexOid, params);
+ /* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
CommitTransactionCommand();
@@ -301,8 +297,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* This clusters the table by creating a new, clustered table and
* swapping the relfilenumbers of the new table and the old table, so
* the OID of the original table is preserved. Thus we do not lose
- * GRANT, inheritance nor references to this table (this was a bug
- * in releases through 7.3).
+ * GRANT, inheritance nor references to this table.
*
* Indexes are rebuilt too, via REINDEX. Since we are effectively bulk-loading
* the new table, it's better to create the indexes afterwards than to fill
@@ -311,8 +306,6 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
- *
- * We expect that OldHeap is already locked in AccessExclusiveLock mode.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
@@ -325,6 +318,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -472,11 +467,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* rebuild_relation does all the dirty work */
rebuild_relation(OldHeap, index, verbose);
-
- /*
- * NB: rebuild_relation does table_close() on OldHeap, and also on index,
- * if the pointer is valid.
- */
+ /* rebuild_relation closes OldHeap, and index if valid */
out:
/* Roll back any GUC changes executed by index functions */
@@ -635,7 +626,6 @@ static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
- Oid indexOid = index ? RelationGetRelid(index) : InvalidOid;
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
@@ -647,9 +637,9 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
MultiXactId cutoffMulti;
LOCKMODE lockmode_new;
- if (OidIsValid(indexOid))
+ if (index)
/* Mark the correct index as clustered */
- mark_index_clustered(OldHeap, indexOid, true);
+ mark_index_clustered(OldHeap, RelationGetRelid(index), true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
@@ -666,10 +656,8 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
accessMethod,
relpersistence,
NoLock, &lockmode_new);
- Assert(lockmode_new == AccessExclusiveLock || lockmode_new == NoLock);
- /* Lock iff not done above. */
- NewHeap = table_open(OIDNewHeap, lockmode_new == NoLock ?
- AccessExclusiveLock : NoLock);
+ /* NewHeap already locked by make_new_heap */
+ NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
copy_table_data(NewHeap, OldHeap, index, verbose,
@@ -683,9 +671,8 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/*
* Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so we could unlock it
- * completely, but it's simpler to pass NoLock than to track all the locks
- * acquired so far.
+ * swapped. The relation is not visible to others, so no need to unlock it
+ * explicitly.
*/
table_close(NewHeap, NoLock);
@@ -710,9 +697,8 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
* After this, the caller should load the new heap with transferred/modified
* data, then call finish_heap_swap to complete the operation.
*
- * If a specific lock mode is needed for the new relation, pass it via the
- * in/out parameter lockmode_new_p. On exit, the output value tells whether
- * the lock was actually acquired.
+ * Locking: lockmode_old is acquired on OldHeap, if not NoLock; lockmode_new_p
+ * is acquired on NewHeap.
*/
Oid
make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -730,13 +716,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
Oid namespaceid;
LOCKMODE lockmode_new;
- if (lockmode_new_p)
- {
- lockmode_new = *lockmode_new_p;
- *lockmode_new_p = NoLock;
- }
- else
- lockmode_new = lockmode_old;
+ lockmode_new = *lockmode_new_p;
+ *lockmode_new_p = NoLock;
OldHeap = table_open(OIDOldHeap, lockmode_old);
OldHeapDesc = RelationGetDescr(OldHeap);
@@ -833,8 +814,7 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
reloptions = (Datum) 0;
NewHeapCreateToastTable(OIDNewHeap, reloptions, lockmode_new, toastid);
- if (lockmode_new_p)
- *lockmode_new_p = lockmode_new;
+ *lockmode_new_p = lockmode_new;
ReleaseSysCache(tuple);
}
@@ -857,9 +837,6 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Oid OIDOldHeap = RelationGetRelid(OldHeap);
- Oid OIDOldIndex = OldIndex ? RelationGetRelid(OldIndex) : InvalidOid;
- Oid OIDNewHeap = RelationGetRelid(NewHeap);
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -978,7 +955,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
- use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+ use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
+ RelationGetRelid(OldIndex));
else
use_sort = false;
@@ -1036,16 +1014,18 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
- reltup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(OIDNewHeap));
+ reltup = SearchSysCacheCopy1(RELOID,
+ ObjectIdGetDatum(RelationGetRelid(NewHeap)));
if (!HeapTupleIsValid(reltup))
- elog(ERROR, "cache lookup failed for relation %u", OIDNewHeap);
+ elog(ERROR, "cache lookup failed for relation %u",
+ RelationGetRelid(NewHeap));
relform = (Form_pg_class) GETSTRUCT(reltup);
relform->relpages = num_pages;
relform->reltuples = num_tuples;
/* Don't update the stats for pg_class. See swap_relation_files. */
- if (OIDOldHeap != RelationRelationId)
+ if (RelationGetRelid(OldHeap) != RelationRelationId)
CatalogTupleUpdate(relRelation, &reltup->t_self, reltup);
else
CacheInvalidateRelcacheByTuple(reltup);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 8eaf951cc16..6b2bcb168b6 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -173,6 +173,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
Oid tableSpace;
Oid relowner;
Oid OIDNewHeap;
+ LOCKMODE newrel_lock;
uint64 processed = 0;
char relpersistence;
Oid save_userid;
@@ -316,10 +317,12 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
* it against access by any other process until commit (by which time it
* will be gone).
*/
+ newrel_lock = AccessExclusiveLock;
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
- relpersistence, ExclusiveLock, NULL);
- LockRelationOid(OIDNewHeap, AccessExclusiveLock);
+ relpersistence, ExclusiveLock,
+ &newrel_lock);
+ /* lock on new rel needn't be explicitly released */
/* Generate the data, if wanted. */
if (!skipData)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f2aa05a2e7c..f982af8a1d4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5799,6 +5799,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
Oid NewAccessMethod;
Oid NewTableSpace;
char persistence;
+ LOCKMODE newrel_lock;
OldHeap = table_open(tab->relid, NoLock);
@@ -5888,8 +5889,10 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* persistence. That wouldn't work for pg_class, but that can't be
* unlogged anyway.
*/
+ newrel_lock = lockmode;
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode, NULL);
+ persistence, lockmode, &newrel_lock);
+ /* lock on NewHeap needn't be explicitly released */
/*
* Copy the heap data into the new table with the desired
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c2ec7101d7..a0158b1fcde 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2223,11 +2223,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params);
+ /* cluster_rel closes the relation, but keeps lock */
- /*
- * cluster_rel() should have closed the relation, lock is kept
- * till commit.
- */
rel = NULL;
}
else
--
2.39.5
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Oct-09, Antonin Houska wrote:
@@ -61,8 +62,12 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re <para> Without a <replaceable class="parameter">table_and_columns</replaceable> list, <command>VACUUM</command> processes every table and materialized view - in the current database that the current user has permission to vacuum. - With a list, <command>VACUUM</command> processes only those table(s). + in the current database that the current user has permission to vacuum. If + the <literal>CONCURRENTLY</literal> is specified (see below), tables which + have not been clustered yet are silently skipped. With a + list, <command>VACUUM</command> processes only those table(s). If + the <literal>CONCURRENTLY</literal> is specified, the list may only contain + tables which have already been clustered. </para>The idea that VACUUM CONCURRENTLY can only process tables that have been
clustered sounds very strange to me. I don't think such a restriction
would really fly. However, I think this may just be a documentation
mistake; can you please clarify?
Right, it was a documentation problem. I think the fact that VACUUM FULL is
internally (almost) an alias for CLUSTER is what distracted me.
I am tempted to suggest that VACUUM CONCURRENTLY should receive a table
list; without a list, it should raise an error. This is not supposed to be
a routine maintenance command that you can run on all your tables, after
all.
ok, implemented
Heck, maybe don't even accept a partitioned table -- the user can
process one partition at a time, if they need that.
I also thought of this but forgot. Done now.
I don't believe in the need for the LOCK_CLUSTER_CONCURRENT define; IMO
the code should just use ShareUpdateExclusiveLock where needed.
ok, removed
In 0001, the new API of make_new_heap() is somewhat bizarre regarding
the output lockmode_new_p parameter.
Oh, it was too messy. I think I was thinking of too many things at once (such
as locking the old heap, the new heap and the new heap's TOAST). Also, one
thing that might have contributed to the confusion is that make_new_heap() has
the 'lockmode' argument, which receives various values from various
callers. However, both the new heap and its TOAST relation are eventually
created by heap_create_with_catalog(), and this function always leaves the new
relation locked in AccessExclusiveMode. Maybe this needs some refactoring.
Therefore I reverted the changes arount make_new_heap() and simply pass NoLock
for lockmode in cluster.c
Anyway, I propose some cosmetic cleanups for 0001 in attachment,
including changing make_new_heap to assume a non-null value of
lockmode_new_p. I didn't go as far as making it no longer a pointer,
but if it can be done, then I suggest we should do that. I didn't try
to apply the next patches in the series after this one.
Thanks, applied (except the changes related to make_new_heap(), which is not
touched in the next version of the patch.)
Next version is attached. It also tries to fix CI problems reported on
machines with -DRELCACHE_FORCE_RELEASE.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v06-0001-Adjust-signature-of-cluster_rel-and-its-subroutines.patchtext/x-diffDownload
From 3e46b7c75ff8b93d6b960ffde3222255cbd52066 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:41 +0100
Subject: [PATCH 1/8] Adjust signature of cluster_rel() and its subroutines.
So far cluster_rel() received OID of the relation it should process and it
performed opening and locking of the relation itself. Yet copy_table_data()
received the OID as well and also had to open the relation itself. This patch
tries to eliminate the repeated opening and closing.
One particular reason for this change is that the VACUUM FULL / CLUSTER
command with the CONCURRENTLY option will need to release all locks on the
relation (and possibly on the clustering index) at some point. Since it makes
little sense to keep relation reference w/o lock, the cluster_rel() function
also closes its reference to the relation (and its index). Neither the
function nor its subroutines may open extra references because then it'd be a
bit harder to close them all.
---
src/backend/commands/cluster.c | 130 ++++++++++++++++-----------------
src/backend/commands/vacuum.c | 9 +--
src/include/commands/cluster.h | 2 +-
3 files changed, 70 insertions(+), 71 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ae0863d9a2..7ec605b0bd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -69,8 +69,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
-static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
@@ -191,13 +191,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
stmt->indexname, stmt->relation->relname)));
}
+ /* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- /* close relation, keep lock till commit */
- table_close(rel, NoLock);
-
- /* Do the job. */
- cluster_rel(tableOid, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms);
+ /* cluster_rel closes the relation, but keeps lock */
return;
}
@@ -274,6 +272,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
foreach(lc, rtcs)
{
RelToCluster *rtc = (RelToCluster *) lfirst(lc);
+ Relation rel;
/* Start a new transaction for each relation. */
StartTransactionCommand();
@@ -281,8 +280,11 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- /* Do the job. */
- cluster_rel(rtc->tableOid, rtc->indexOid, params);
+ rel = table_open(rtc->tableOid, AccessExclusiveLock);
+
+ /* Process this table */
+ cluster_rel(rel, rtc->indexOid, params);
+ /* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
CommitTransactionCommand();
@@ -295,8 +297,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* This clusters the table by creating a new, clustered table and
* swapping the relfilenumbers of the new table and the old table, so
* the OID of the original table is preserved. Thus we do not lose
- * GRANT, inheritance nor references to this table (this was a bug
- * in releases through 7.3).
+ * GRANT, inheritance nor references to this table.
*
* Indexes are rebuilt too, via REINDEX. Since we are effectively bulk-loading
* the new table, it's better to create the indexes afterwards than to fill
@@ -307,14 +308,17 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* and error messages should refer to the operation as VACUUM not CLUSTER.
*/
void
-cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
{
- Relation OldHeap;
+ Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
int save_sec_context;
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
+ Relation index = NULL;
+
+ Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -327,21 +331,6 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
- /*
- * We grab exclusive access to the target rel and index for the duration
- * of the transaction. (This is redundant for the single-transaction
- * case, since cluster() already did it.) The index lock is taken inside
- * check_index_is_clusterable.
- */
- OldHeap = try_relation_open(tableOid, AccessExclusiveLock);
-
- /* If the table has gone away, we can skip processing it */
- if (!OldHeap)
- {
- pgstat_progress_end_command();
- return;
- }
-
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -444,7 +433,11 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
+ {
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ /* Open the index (It should already be locked.) */
+ index = index_open(indexOid, NoLock);
+ }
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -473,9 +466,8 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, indexOid, verbose);
-
- /* NB: rebuild_relation does table_close() on OldHeap */
+ rebuild_relation(OldHeap, index, verbose);
+ /* rebuild_relation closes OldHeap, and index if valid */
out:
/* Roll back any GUC changes executed by index functions */
@@ -624,44 +616,67 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* rebuild_relation: rebuild an existing relation in index or physical order
*
* OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * indexOid: index to cluster by, or InvalidOid to rewrite in physical order.
+ * index: index to cluster by, or NULL to rewrite in physical order. Must be
+ * opened and locked.
*
- * NB: this routine closes OldHeap at the right time; caller should not.
+ * On exit, the heap (and also the index, if one was passed) are closed, but
+ * still locked with AccessExclusiveLock.
*/
static void
-rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
Oid tableSpace = OldHeap->rd_rel->reltablespace;
Oid OIDNewHeap;
+ Relation NewHeap;
char relpersistence;
bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
- if (OidIsValid(indexOid))
+ if (index)
/* Mark the correct index as clustered */
- mark_index_clustered(OldHeap, indexOid, true);
+ mark_index_clustered(OldHeap, RelationGetRelid(index), true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entry, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
-
- /* Create the transient table that will receive the re-ordered data */
+ /*
+ * Create the transient table that will receive the re-ordered data.
+ *
+ * NoLock for the old heap because we already have it locked and want to
+ * keep unlocking straightforward. The new heap (and its TOAST if one
+ * exists) will be locked in AccessExclusiveMode on return. Since others
+ * can't see it yet, we do not care.
+ */
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
- AccessExclusiveLock);
+ NoLock);
+ NewHeap = table_open(OIDNewHeap, NoLock);
+ /* NewHeap already locked by make_new_heap */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
/* Copy the heap data into the new table in the desired order */
- copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage is
+ * swapped. The relation is not visible to others, so no need to unlock it
+ * explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
/*
* Swap the physical files of the target and transient tables, then
* rebuild the target's indexes and throw away the transient table.
@@ -810,13 +825,10 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
bool *pSwapToastByContent, TransactionId *pFreezeXid,
MultiXactId *pCutoffMulti)
{
- Relation NewHeap,
- OldHeap,
- OldIndex;
Relation relRelation;
HeapTuple reltup;
Form_pg_class relform;
@@ -835,16 +847,6 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
pg_rusage_init(&ru0);
- /*
- * Open the relations we need.
- */
- NewHeap = table_open(OIDNewHeap, AccessExclusiveLock);
- OldHeap = table_open(OIDOldHeap, AccessExclusiveLock);
- if (OidIsValid(OIDOldIndex))
- OldIndex = index_open(OIDOldIndex, AccessExclusiveLock);
- else
- OldIndex = NULL;
-
/* Store a copy of the namespace name for logging purposes */
nspname = get_namespace_name(RelationGetNamespace(OldHeap));
@@ -945,7 +947,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
- use_sort = plan_cluster_use_sort(OIDOldHeap, OIDOldIndex);
+ use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
+ RelationGetRelid(OldIndex));
else
use_sort = false;
@@ -1000,24 +1003,21 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
tups_recently_dead,
pg_rusage_show(&ru0))));
- if (OldIndex != NULL)
- index_close(OldIndex, NoLock);
- table_close(OldHeap, NoLock);
- table_close(NewHeap, NoLock);
-
/* Update pg_class to reflect the correct values of pages and tuples. */
relRelation = table_open(RelationRelationId, RowExclusiveLock);
- reltup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(OIDNewHeap));
+ reltup = SearchSysCacheCopy1(RELOID,
+ ObjectIdGetDatum(RelationGetRelid(NewHeap)));
if (!HeapTupleIsValid(reltup))
- elog(ERROR, "cache lookup failed for relation %u", OIDNewHeap);
+ elog(ERROR, "cache lookup failed for relation %u",
+ RelationGetRelid(NewHeap));
relform = (Form_pg_class) GETSTRUCT(reltup);
relform->relpages = num_pages;
relform->reltuples = num_tuples;
/* Don't update the stats for pg_class. See swap_relation_files. */
- if (OIDOldHeap != RelationRelationId)
+ if (RelationGetRelid(OldHeap) != RelationRelationId)
CatalogTupleUpdate(relRelation, &reltup->t_self, reltup);
else
CacheInvalidateRelcacheByTuple(reltup);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..a0158b1fcd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2218,15 +2218,14 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
- /* close relation before vacuuming, but hold lock until commit */
- relation_close(rel, NoLock);
- rel = NULL;
-
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(relid, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ rel = NULL;
}
else
table_relation_vacuum(rel, params, bstrategy);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4e32380417..2d8e363015 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -32,7 +32,7 @@ typedef struct ClusterParams
} ClusterParams;
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
--
2.45.2
v06-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From a3898b0e02472136b05a7c6d6eea442311686efd Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:41 +0100
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index b80249a79e..55c8ddd89e 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -32,9 +32,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -55,7 +55,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -76,7 +76,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -133,7 +133,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -154,11 +154,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 22c6dc378c..49e856e604 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -376,8 +376,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 60a397dc56..5f7928e865 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 7b63d38f97..e09598eafc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 4e8b39a66d..77fc409f9f 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v06-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From be7dc7ed0a0cc4a5fd30b68f06f29ccd17b16adc Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:41 +0100
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6a4da3266..097dc82f6f 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb..4c573b2ded 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,7 +153,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -536,7 +535,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 3c1454df99..cb2a400cdc 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index afc284e9c3..874c59b60d 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v06-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From ecd061fdcaab721454046685f9209943ab7910f7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:42 +0100
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 116 +-
doc/src/sgml/ref/vacuum.sgml | 22 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2572 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 126 +-
src/backend/meson.build | 1 +
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 288 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 94 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 5 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
41 files changed, 3563 insertions(+), 198 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..6abf639b3e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5688,14 +5688,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5776,6 +5797,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..526f0c5843 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,17 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -111,6 +115,108 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9110938fab..da2bcd85c0 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -62,7 +63,8 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ With a list, <command>VACUUM</command> processes only those table(s). The
+ list is required if the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -360,6 +362,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> option of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dc..a842b84415 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2070,8 +2070,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c..06cd85b34b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -681,6 +685,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -701,6 +707,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -779,8 +787,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -835,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -851,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -870,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -884,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -896,9 +905,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -915,7 +962,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -930,6 +977,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -973,7 +1047,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2612,6 +2686,53 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d702592469 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d3..a1940c1fb9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1416,22 +1416,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1470,6 +1455,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..3652b8a9c5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,16 +1240,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ec605b0bd..4a4b51f77d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -56,6 +66,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -67,17 +79,183 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lockmode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent, bool is_vacuum);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel, bool is_vacuum);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -109,10 +287,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -121,6 +301,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -129,20 +311,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -194,7 +386,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel, false);
/* cluster_rel closes the relation, but keeps lock */
return;
@@ -202,10 +394,29 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("CLUSTER (CONCURRENTLY) not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("CLUSTER (CONCURRENTLY) requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
*/
PreventInTransactionBlock(isTopLevel, "CLUSTER");
@@ -230,11 +441,14 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
check_index_is_clusterable(rel, indexOid, AccessShareLock);
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
{
@@ -243,7 +457,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lockmode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +474,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lockmode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -280,10 +495,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel, false);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +521,16 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel, bool isVacuum)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,8 +539,46 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index = NULL;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ check_relation_is_clusterable_concurrently(OldHeap, isVacuum);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -355,7 +615,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -370,7 +630,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -381,7 +641,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -392,7 +652,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -408,6 +668,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -434,7 +699,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* Open the index (It should already be locked.) */
index = index_open(indexOid, NoLock);
}
@@ -449,7 +714,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -462,11 +728,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose, concurrent, isVacuum);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(!success);
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -612,18 +909,84 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+void
+check_relation_is_clusterable_concurrently(Relation rel, bool is_vacuum)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+ const char *cmd = is_vacuum ? "VACUUM" : "CLUSTER";
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is not supported for catalog relations.",
+ cmd)));
+
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is not supported for TOAST relations, unless the main relation is processed too.",
+ cmd)));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is only allowed for permanent relations.",
+ cmd)));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
+ * OldHeap: table to rebuild --- must be opened and locked. See cluster_rel()
+ * for comments on the required lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order. Must be
* opened and locked.
*
* On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * still locked with AccessExclusiveLock. (The function handles the lock
+ * upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent, bool is_vacuum)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -631,10 +994,75 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we spend notable amount of time to
+ * setup the logical decoding. Not sure though if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap, is_vacuum);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
if (index)
/* Mark the correct index as clustered */
@@ -642,7 +1070,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -661,30 +1088,51 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -819,15 +1267,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -844,6 +1296,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -870,8 +1323,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -947,8 +1404,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtransaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -977,7 +1472,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -986,7 +1483,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1439,14 +1940,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1472,39 +1972,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1744,3 +2251,1886 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+ /* Make sure the reopened relcache entry is used, not the old one. */
+ rel = *rel_p;
+
+ /* Avoid logical decoding of other relations by this backend. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(bool error)
+{
+ ClusteredRel key;
+ ClusteredRel *entry = NULL, *entry_toast = NULL;
+ Oid relid = clustered_rel;
+ Oid toastrelid = clustered_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(clustered_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = clustered_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ clustered_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that CLUSTER CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * ShareUpdateExclusiveLock, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < ShareUpdateExclusiveLock)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel, bool is_vacuum)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ *
+ * This function was already called, but the relation was unlocked since
+ * (see begin_concurrent_cluster()). check_catalog_changes() should catch
+ * any "disruptive" changes in the future.
+ */
+ check_relation_is_clusterable_concurrently(rel, is_vacuum);
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != clustered_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw, *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 694da8291e..4fafa2f807 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -905,7 +905,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6ccae4cb4a..b0d6318592 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4480,6 +4480,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5909,6 +5919,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a0158b1fcd..333ce98060 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -111,7 +111,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +153,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +227,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +303,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +383,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -543,7 +552,17 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
+ /*
+ * Concurrent processing is currently considered rather special so it
+ * is not performed in bulk.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ ereport(ERROR,
+ (errmsg("VACUUM (CONCURRENTLY) requires explicit list of tables")));
+
relations = get_all_vacuum_rels(vac_context, params->options);
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -960,6 +980,17 @@ expand_vacuum_rel(VacuumRelation *vrel, MemoryContext vac_context,
(errmsg("VACUUM ONLY of partitioned table \"%s\" has no effect",
vrel->relation->relname)));
+ /*
+ * Concurrent processing is currently considered rather special
+ * (e.g. in terms of resources consumed) so it is not performed in
+ * bulk.
+ */
+ if (is_partitioned_table && (options & VACOPT_FULL_CONCURRENT))
+ ereport(ERROR,
+ (errmsg("VACUUM (CONCURRENTLY) not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+
+
ReleaseSysCache(tuple);
/*
@@ -1954,10 +1985,10 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1972,7 +2003,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2035,10 +2066,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2053,6 +2085,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= ShareUpdateExclusiveLock);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2126,19 +2174,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2191,6 +2226,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2218,11 +2277,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel, true);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2268,13 +2338,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 78c5726814..0f9141a4ac 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e73576ad12..06a9d4a61f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 097dc82f6f..61a57053c7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..43f7b34297
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,288 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst, *dst_start;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854f..11ae537a8d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f28bf37105..81f3a0a141 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1302,6 +1302,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 55c8ddd89e..bab78bd34f 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -162,3 +162,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72..5dc361d5d6 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index fc972ed17d..d652bf60cf 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1565,6 +1565,28 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 422509f18d..b20a7405e6 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1258,6 +1259,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 4c573b2ded..d7c1ba2f5b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,7 +154,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -591,7 +590,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index bbd08770c3..cd3fdd3659 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3104,7 +3104,7 @@ match_previous_words(int pattern_id,
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -5103,7 +5103,8 @@ match_previous_words(int pattern_id,
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 96cf82f97b..e4a32fc391 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -413,6 +413,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93c..fbc898028f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -629,6 +630,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1676,6 +1679,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1688,6 +1695,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1696,6 +1705,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..943fe71ba6 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 2d8e363015..c0f2cdabf0 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,91 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel, bool isVacuum);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void check_relation_is_clusterable_concurrently(Relation rel,
+ bool is_vacuum);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -44,8 +129,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d64523..03e3712ede 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d3..2f693e0fc0 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index cb2a400cdc..8b8a7d3634 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 810b297edf..2a1583f367 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,9 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, CLUSTER
+ * CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54f..b24c003c53 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index e09598eafc..5ab5df9d41 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 299cd7585f..6c15b035f9 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -49,6 +49,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..adda46c985 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 874c59b60d..91c70621ec 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -62,6 +62,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..81300642a5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1962,17 +1962,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v06-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From cdd01791055122cd3c68478d33db02d5d1e6f31c Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:42 +0100
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
src/include/utils/snapshot.h | 3 +
13 files changed, 389 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 1939cfb4d2..47b05b2135 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a842b84415..d56d0f5164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -58,7 +58,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -1966,7 +1967,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1982,15 +1983,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2621,7 +2623,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2678,11 +2681,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2699,6 +2702,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2924,7 +2928,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -2992,8 +2997,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3014,6 +3023,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3103,10 +3121,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3145,12 +3164,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3190,6 +3208,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3982,8 +4001,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3993,7 +4016,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4348,10 +4372,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8682,7 +8706,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8693,10 +8718,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 06cd85b34b..16525a4669 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3ebd7c4041..c227520757 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5628,6 +5657,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4a4b51f77d..1cb3201c5a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -200,6 +200,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2963,6 +2964,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3118,6 +3122,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw, *src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3186,8 +3191,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3198,6 +3225,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3210,11 +3239,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3234,10 +3266,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3245,6 +3297,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3255,6 +3308,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3272,18 +3327,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3293,6 +3366,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3303,7 +3377,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3321,7 +3410,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3329,7 +3418,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3401,6 +3490,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 06a9d4a61f..140b063a6c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +986,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 61a57053c7..09ce0db562 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index 43f7b34297..8e915c55fb 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -265,6 +311,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -278,6 +329,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e4a32fc391..4f8bb5677f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -317,21 +317,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4591e9a918..90eea6dcd8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a..2f9be7afaa 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c0f2cdabf0..e64b21c862 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -58,6 +58,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -89,6 +97,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -109,6 +119,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 7d3ba38f2c..01d7ca8420 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
--
2.45.2
v06-0006-Add-regression-tests.patchtext/x-diffDownload
From a6dab84df8d619a8beec9d742dc97b7f32e189c3 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:42 +0100
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1cb3201c5a..422ae200aa 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3713,6 +3714,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58..e40ebec1bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f1900115..fe65ba44ae 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,10 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v06-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From ed52599e3f8983d7271a91144e5f4f1c8e26c0bf Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:42 +0100
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 293 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e0c8325a39..94d9e7a141 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10638,6 +10638,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 526f0c5843..5175a1b8da 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -140,7 +140,14 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>CLUSTER</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 16525a4669..585e97335c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1004,7 +1004,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 422ae200aa..53b0f7f9b3 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -102,6 +104,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -188,7 +199,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -205,13 +217,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3019,7 +3033,8 @@ get_changed_tuple(char *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3057,6 +3072,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3099,7 +3117,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3129,6 +3148,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3253,10 +3275,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3459,11 +3493,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3472,10 +3510,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3487,7 +3534,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3497,6 +3544,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3657,6 +3726,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3736,7 +3807,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3858,9 +3930,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad2..d78f5d8984 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2791,6 +2792,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca..2f6022134c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -727,6 +727,7 @@
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index e64b21c862..89cbb6be59 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -139,7 +141,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void check_relation_is_clusterable_concurrently(Relation rel,
bool is_vacuum);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
v06-0008-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 26772eb30bc1f47613a8e661349943eae677e3a7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 11 Dec 2024 19:22:42 +0100
Subject: [PATCH 8/8] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. VACUUM FULL / CLUSTER CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by VACUUM FULL
/ CLUSTER, these commands must record the information on individual tuples
being moved from the old file to the new one. This is what
logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. VACUUM FULL / CLUSTER CONCURRENTLY is processing a relation, but once it
has released all the locks (in order to get the exclusive lock), another
backend runs VACUUM FULL / CLUSTER CONCURRENTLY on the same table. Since the
relation is treated as a system catalog while these commands are processing it
(so it can be scanned using a historic snapshot during the "initial load"), it
is important that the 2nd backend does not break decoding of the "combo CIDs"
performed by the 1st backend.
However, it's not practical to let multiple backends run VACUUM FULL / CLUSTER
CONCURRENTLY on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_cluster/pgoutput_cluster.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 194 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 585e97335c..c01a6192c2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -730,7 +730,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 09ef220449..86881e8638 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 53b0f7f9b3..59dffc3bdd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -200,17 +201,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -224,7 +229,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3118,7 +3124,7 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3192,7 +3198,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3200,7 +3207,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -3242,11 +3249,23 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetClusterCurrentXids();
@@ -3299,9 +3318,12 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3311,6 +3333,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3326,6 +3351,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3359,15 +3392,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3377,7 +3417,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3387,6 +3427,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3410,8 +3454,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3420,7 +3465,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3501,7 +3549,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
ClusterDecodingState *dstate;
@@ -3534,7 +3583,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3719,6 +3769,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3804,11 +3855,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3930,6 +3996,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* cluster_max_xlock_time is not exceeded.
@@ -3957,11 +4028,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 140b063a6c..5e3f85fe78 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -983,11 +983,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1012,6 +1014,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1033,11 +1042,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1054,6 +1066,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1063,6 +1076,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1081,6 +1101,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1099,11 +1127,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1135,6 +1164,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing CLUSTER
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index 8e915c55fb..f153d1b128 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -34,7 +34,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -169,7 +169,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -187,10 +188,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -203,7 +204,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -237,13 +239,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -316,6 +318,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 5866a26bdd..de62b6abf8 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 89cbb6be59..6844405d25 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -63,6 +63,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3bc365a7b0..81dc80596e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * CLUSTER CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.45.2
On 2024-Dec-11, Antonin Houska wrote:
Oh, it was too messy. I think I was thinking of too many things at once (such
as locking the old heap, the new heap and the new heap's TOAST). Also, one
thing that might have contributed to the confusion is that make_new_heap() has
the 'lockmode' argument, which receives various values from various
callers. However, both the new heap and its TOAST relation are eventually
created by heap_create_with_catalog(), and this function always leaves the new
relation locked in AccessExclusiveMode. Maybe this needs some refactoring.Therefore I reverted the changes arount make_new_heap() and simply pass NoLock
for lockmode in cluster.c
Cool, thanks, I have pushed this. I made some additional minor changes,
nothing earth-shattering.
Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.
I'm not happy with the idea of having this new command be VACUUM (FULL
CONCURRENTLY). It's a bit of an absurd name if you ask me. Heck, even
VACUUM (FULL) seems a bit absurd nowadays.
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:
- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZE
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Hi
čt 9. 1. 2025 v 14:35 odesílatel Alvaro Herrera <alvherre@alvh.no-ip.org>
napsal:
On 2024-Dec-11, Antonin Houska wrote:
Oh, it was too messy. I think I was thinking of too many things at once
(such
as locking the old heap, the new heap and the new heap's TOAST). Also,
one
thing that might have contributed to the confusion is that
make_new_heap() has
the 'lockmode' argument, which receives various values from various
callers. However, both the new heap and its TOAST relation are eventually
created by heap_create_with_catalog(), and this function always leavesthe new
relation locked in AccessExclusiveMode. Maybe this needs some
refactoring.
Therefore I reverted the changes arount make_new_heap() and simply pass
NoLock
for lockmode in cluster.c
Cool, thanks, I have pushed this. I made some additional minor changes,
nothing earth-shattering.Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.I'm not happy with the idea of having this new command be VACUUM (FULL
CONCURRENTLY). It's a bit of an absurd name if you ask me. Heck, even
VACUUM (FULL) seems a bit absurd nowadays.
Although it can sound absurd - it makes perfect sense for me - both "FULL"
and "CONCURRENTLY" are years used terms.
Maybe we can introduce a synonym like COMPACT for FULL.
I don't see a strong benefit for introducing a new command (with almost all
identical functionality) just because the words sound strange. But I
understand this.
There is inconsistency between behaviour of VACUUM COMMAND - lazy VACUUM is
by default CONCURRENTLY, but FULL VACUUM NOT, so I understand
to request to introduce a new command. But is it really necessary?
Introducing synonyms for (FULL) COMPACT or SQUEEZE can be enough, and can
work well. It is better than the introduction synonym for VACUUM command.
Regards
Pavel
Show quoted text
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZE--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2024-Dec-11, Antonin Houska wrote:
Oh, it was too messy. I think I was thinking of too many things at once (such
as locking the old heap, the new heap and the new heap's TOAST). Also, one
thing that might have contributed to the confusion is that make_new_heap() has
the 'lockmode' argument, which receives various values from various
callers. However, both the new heap and its TOAST relation are eventually
created by heap_create_with_catalog(), and this function always leaves the new
relation locked in AccessExclusiveMode. Maybe this needs some refactoring.Therefore I reverted the changes arount make_new_heap() and simply pass NoLock
for lockmode in cluster.cCool, thanks, I have pushed this. I made some additional minor changes,
nothing earth-shattering.
It seems you accidentally fixed another problem :-) I was referring to the
'lockmode' argument of make_new_heap(). I can try to write a patch for that
but ...
Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.
... I can apply v06 even though I do have the commit ebd8fc7e47 in my working
tree. (And the CF bot does not complain (yet?).) Have you removed the
'lockmode' argument also from make_new_heap() and forgot to push it? This
change would probably cause a conflict with v06.
I'm not happy with the idea of having this new command be VACUUM (FULL
CONCURRENTLY). It's a bit of an absurd name if you ask me. Heck, even
VACUUM (FULL) seems a bit absurd nowadays.Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZE
I recall that DB2 has REORG command, which also can do clustering [1]https://www.ibm.com/docs/en/db2/12.1?topic=commands-reorg-table
Regardless the name of the new command, should that also handle the
non-concurrent cases? In that case we'd probably need to mark CLUSTER and
VACUUM (FULL) as deprecated.
[1]: https://www.ibm.com/docs/en/db2/12.1?topic=commands-reorg-table
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2025-Jan-09, Antonin Houska wrote:
It seems you accidentally fixed another problem :-) I was referring to the
'lockmode' argument of make_new_heap(). I can try to write a patch for that
but ...Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.... I can apply v06 even though I do have the commit ebd8fc7e47 in my working
tree. (And the CF bot does not complain (yet?).) Have you removed the
'lockmode' argument also from make_new_heap() and forgot to push it? This
change would probably cause a conflict with v06.
Hmm, I'm not sure what you mean. My changes are just comment updates,
plus I added asserts that relations are locked to match what the
comments say (this made me touch matview.c, including removal of a
pointless lock acquisition there); also in cluster_rel I moved one
initialization to a different place. A diff between your patch and what
I committed is below. But neither of those changes cause the conflict
to apply 0002,3,4 from v06 to master; rather the conflict comes from
older commits to cluster.c. I'm really surprised that you see no
conflicts. Maybe I'm doing something wrong. (I tried "git am", "git
apply -3" and "patch -p1" and they all result in a few rejects.)
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZEI recall that DB2 has REORG command, which also can do clustering [1]
That's an idea, but ... man, is that ugly!
Regardless the name of the new command, should that also handle the
non-concurrent cases? In that case we'd probably need to mark CLUSTER and
VACUUM (FULL) as deprecated.
Hmm, I don't feel a need to change CLUSTER (since it does one pretty
well defined thing), but I would be okay adding a command that does both
things (concurrent and non concurrent, chosen as per options specified),
and redefine VACUUM FULL as "an obsolete alias for New Shiny Command".
I'm not strongly opposed to also aliasing CLUSTER, note, I just think
it's not really necessary.
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4c0d6b80d2a..99193f5c886 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -316,7 +316,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
int save_nestlevel;
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
- Relation index = NULL;
+ Relation index;
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
@@ -434,10 +434,13 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
+ /* verify the index is good and lock it */
check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
- /* Open the index (It should already be locked.) */
+ /* also open it */
index = index_open(indexOid, NoLock);
}
+ else
+ index = NULL;
/*
* Quietly ignore the request if this is a materialized view which has not
@@ -615,12 +618,12 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild --- must be opened and exclusive-locked!
- * index: index to cluster by, or NULL to rewrite in physical order. Must be
- * opened and locked.
+ * OldHeap: table to rebuild.
+ * index: index to cluster by, or NULL to rewrite in physical order.
*
- * On exit, the heap (and also the index, if one was passed) are closed, but
- * still locked with AccessExclusiveLock.
+ * On entry, heap and index (if one is given) must be open, and
+ * AccessExclusiveLock held on them.
+ * On exit, they are closed, but locks on them are not released.
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose)
@@ -636,6 +639,9 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+
if (index)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
@@ -647,18 +653,15 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/*
* Create the transient table that will receive the re-ordered data.
*
- * NoLock for the old heap because we already have it locked and want to
- * keep unlocking straightforward. The new heap (and its TOAST if one
- * exists) will be locked in AccessExclusiveMode on return. Since others
- * can't see it yet, we do not care.
+ * OldHeap is already locked, so no need to lock it again. make_new_heap
+ * obtains AccessExclusiveLock on the new heap and its toast table.
*/
OIDNewHeap = make_new_heap(tableOid, tableSpace,
accessMethod,
relpersistence,
NoLock);
+ Assert(CheckRelationOidLockedByMe(OIDNewHeap, AccessExclusiveLock, false));
NewHeap = table_open(OIDNewHeap, NoLock);
- /* NewHeap already locked by make_new_heap */
- Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
/* Copy the heap data into the new table in the desired order */
copy_table_data(NewHeap, OldHeap, index, verbose,
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 4b3d4822872..c12817091ed 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -319,7 +319,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
OIDNewHeap = make_new_heap(matviewOid, tableSpace,
matviewRel->rd_rel->relam,
relpersistence, ExclusiveLock);
- LockRelationOid(OIDNewHeap, AccessExclusiveLock);
+ Assert(CheckRelationOidLockedByMe(OIDNewHeap, AccessExclusiveLock, false));
/* Generate the data, if wanted. */
if (!skipData)
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi
čt 9. 1. 2025 v 14:35 odesílatel Alvaro Herrera <alvherre@alvh.no-ip.org> napsal:
On 2024-Dec-11, Antonin Houska wrote:
Oh, it was too messy. I think I was thinking of too many things at once (such
as locking the old heap, the new heap and the new heap's TOAST). Also, one
thing that might have contributed to the confusion is that make_new_heap() has
the 'lockmode' argument, which receives various values from various
callers. However, both the new heap and its TOAST relation are eventually
created by heap_create_with_catalog(), and this function always leaves the new
relation locked in AccessExclusiveMode. Maybe this needs some refactoring.Therefore I reverted the changes arount make_new_heap() and simply pass NoLock
for lockmode in cluster.cCool, thanks, I have pushed this. I made some additional minor changes,
nothing earth-shattering.Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.I'm not happy with the idea of having this new command be VACUUM (FULL
CONCURRENTLY). It's a bit of an absurd name if you ask me. Heck, even
VACUUM (FULL) seems a bit absurd nowadays.Although it can sound absurd - it makes perfect sense for me - both "FULL" and "CONCURRENTLY" are years used terms.
Maybe we can introduce a synonym like COMPACT for FULL.
Yes, at first glance, FULL might indicate to users that it processes the whole
table, however VACUUM does that regardless this option. COMPACT would be more
accurate because it would tell that, besides removing dead tuples, unused
space is removed properly.
However I'm not sure if the FULL option should have been added to VACUUM at
all. Note that, internally, it uses completely different approach to the
problem of garbage collection. As a consequence, there are several options
which are not compatible with the FULL option: PARALLEL,
DISABLE_PAGE_SKIPPING, BUFFER_USAGE_LIMIT, and maybe some more.
Thus I understand Alvaro's objections against VACUUM (FULL, CONCURRENTLY).
I don't see a strong benefit for introducing a new command (with almost all
identical functionality) just because the words sound strange.
If we turn the FULL option into an alias for the new command, and remove that
after "some time", then there is no identical functionality anymore.
The new functionality overlaps with CLUSTER, except that it works
CONCURRENTLY. However, invoking the new functionality via CLUSTER
(CONCURRENTLY) is not a complete solution because it's also usable w/o
ordering. That's why a new command makes sense to me.
After all, the new code aims primarily at bloat removal rather than at
ordering. Note that it only orders the existing rows, but does not even try to
order the rows inserted into the table while the data is being copied to the
new file.
Therefore I can imagine adding a new command that acts like VACUUM (FULL,
CONCURRENTLY), but does not try to be CLUSTER (CONCURRENTL).
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Em sex., 10 de jan. de 2025 às 06:31, Antonin Houska <ah@cybertec.at>
escreveu:
Thus I understand Alvaro's objections against VACUUM (FULL, CONCURRENTLY).
Therefore I can imagine adding a new command that acts like VACUUM (FULL,
CONCURRENTLY), but does not try to be CLUSTER (CONCURRENTL).
If VACUUM FULL and CLUSTER do the same, why not have a single command ?
--VACUUM FULL would be
RECREATE TABLE [ CONCURRENTLY ] table_name
--CLUSTER would be
RECREATE TABLE [ CONCURRENTLY ] table_name CLUSTERED [ON index_name]
--Maybe someday reordering fields would be
RECREATE TABLE [ CONCURRENTLY ] table_name CLUSTERED [ON index_name] [USING
FIELDS (FLD4,FLD5,FLD3,FLD1,FLD2)]
regards
Marcos
On 2025-01-09 Th 8:35 AM, Alvaro Herrera wrote:
I'm not happy with the idea of having this new command be VACUUM (FULL
CONCURRENTLY). It's a bit of an absurd name if you ask me. Heck, even
VACUUM (FULL) seems a bit absurd nowadays.Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZE
My $0.02:
COMPACT tablename ...
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZEMy $0.02:
COMPACT tablename ...
+1 to "COMPACT tablename"
Also the current pg_stat_progress_cluster also tracks
progress for both CLUSTER and VACUUM FULL. pg_stat_progress_vacuum
tracks progress for VACUUM. I suggest we create a new progress
view for VACUUM FULL ( or the new command agreed upon ).
Regards,
Sami
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-09, Antonin Houska wrote:
It seems you accidentally fixed another problem :-) I was referring to the
'lockmode' argument of make_new_heap(). I can try to write a patch for that
but ...Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.
This is the patch series rebased on top of the commit cc811f92ba.
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v07-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From 8fc954ac4f917fa696391a16d75bafeb4f8b2ee0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 2/8] Move progress related fields from PgBackendStatus to
PgBackendProgress.
VACUUM FULL / CLUSTER CONCURRENTLY will need to save and restore these fields
at some point. This is because plan_cluster_use_sort() has to be called in a
subtransaction (so that it does not leave any additional locks on the table)
and rollback of that subtransaction clears the progress information.
---
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
5 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 99a8c73bf0..eebc968193 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -32,9 +32,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -55,7 +55,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -76,7 +76,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -133,7 +133,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -154,11 +154,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 731342799a..b49cbf147d 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -376,8 +376,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5f8d20a406..25887f06e7 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -269,7 +269,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -279,9 +279,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab40..739629cb21 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -30,8 +30,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index d3d4ff6c5c..a73c76a442 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.45.2
v07-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From aa264f420d1d616f2e29aa0a966b77fa818b2ed1 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 3/8] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). The VACUUM FULL
/ CLUSTER will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bbedd3de31..84bf0503a5 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee..42bded373b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,7 +153,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -532,7 +531,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e..6d4d2d1814 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be7164..147b190210 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.45.2
v07-0004-Add-CONCURRENTLY-option-to-both-VACUUM-FULL-and-CLUS.patchtext/plainDownload
From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
Both VACUUM FULL and CLUSTER commands copy the relation data into a new file,
create new indexes and eventually swap the files. To make sure that the old
file does not change during the copying, the relation is locked in an
exclusive mode, which prevents applications from both reading and writing. (To
keep the data consistent, we'd only need to prevent the applications from
writing, but even reading needs to be blocked before we can swap the files -
otherwise some applications could continue using the old file. Since we cannot
get stronger lock without releasing the weaker one first, we acquire the
exclusive lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if VACUUM FULL / CLUSTER with the CONCURRENTLY
option is running on the same relation, and raises an ERROR if it is.
Like the existing implementation of both VACUUM FULL and CLUSTER commands, the
variant with the CONCURRENTLY option also requires an extra space for the new
relation and index files (which coexist with the old files for some time). In
addition, the CONCURRENTLY option might introduce a lag in releasing WAL
segments for archiving / recycling. This is due to the decoding of the data
changes done by application concurrently. However, this lag should not be more
than a single WAL segment.
---
doc/src/sgml/monitoring.sgml | 36 +-
doc/src/sgml/ref/cluster.sgml | 116 +-
doc/src/sgml/ref/vacuum.sgml | 22 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/cluster.c | 2585 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 126 +-
src/backend/meson.build | 1 +
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_cluster/Makefile | 32 +
.../replication/pgoutput_cluster/meson.build | 18 +
.../pgoutput_cluster/pgoutput_cluster.c | 288 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 11 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 22 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 5 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 94 +-
src/include/commands/progress.h | 17 +-
src/include/commands/vacuum.h | 17 +-
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 5 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 17 +-
41 files changed, 3572 insertions(+), 202 deletions(-)
create mode 100644 src/backend/replication/pgoutput_cluster/Makefile
create mode 100644 src/backend/replication/pgoutput_cluster/meson.build
create mode 100644 src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54..985b20a81e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5727,14 +5727,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -5815,6 +5836,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>CLUSTER</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>CLUSTER</command> is currently processing the DML commands
+ that other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea..356b40e3fe 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -26,6 +26,7 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
</synopsis>
</refsynopsisdiv>
@@ -69,14 +70,17 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable> reclusters all the
previously-clustered tables in the current database that the calling user
has privileges for. This form of <command>CLUSTER</command> cannot be
- executed inside a transaction block.
+ executed inside a transaction block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
- When a table is being clustered, an <literal>ACCESS
- EXCLUSIVE</literal> lock is acquired on it. This prevents any other
- database operations (both reads and writes) from operating on the
- table until the <command>CLUSTER</command> is finished.
+ When a table is being clustered, an <literal>ACCESS EXCLUSIVE</literal>
+ lock is acquired on it. This prevents any other database operations (both
+ reads and writes) from operating on the table until
+ the <command>CLUSTER</command> is finished. If you want to keep the table
+ accessible during the clustering, consider using
+ the <literal>CONCURRENTLY</literal> option.
</para>
</refsect1>
@@ -112,6 +116,108 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being clustered.
+ </para>
+
+ <para>
+ Internally, <command>CLUSTER</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>CLUSTER</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the clustering started. Also
+ note <command>CLUSTER</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ clustering.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained below,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>CLUSTER</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>CLUSTER</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d4..ba8669026a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -39,6 +39,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+ CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
@@ -62,7 +63,8 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
Without a <replaceable class="parameter">table_and_columns</replaceable>
list, <command>VACUUM</command> processes every table and materialized view
in the current database that the current user has permission to vacuum.
- With a list, <command>VACUUM</command> processes only those table(s).
+ With a list, <command>VACUUM</command> processes only those table(s). The
+ list is required if the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -361,6 +363,24 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being vacuumed. If
+ this option is specified, <command>VACUUM</command> can only process
+ tables which have already been clustered. For more information, see the
+ description of the <literal>CONCURRENTLY</literal> option of the
+ <xref linkend="sql-cluster"/> command.
+ </para>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option can only be used
+ if <literal>FULL</literal> is used at the same time.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><replaceable class="parameter">boolean</replaceable></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..8b9d30ff72 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_cluster \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d6..552993d4ef 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2073,8 +2073,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing VACUUM FULL / CLUSTER, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f8..c5ec21ca2f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -681,6 +685,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -701,6 +707,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -779,8 +787,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -835,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -851,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -870,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -884,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -896,9 +905,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -915,7 +962,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -930,6 +977,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -973,7 +1047,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2609,6 +2683,53 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index e146605bd5..d9be93aadc 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7377912b41..1f9aa5bf8f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1417,22 +1417,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1471,6 +1456,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7a595c84db..d1e75b5f44 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,16 +1240,19 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 99193f5c88..c9cc061c45 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -56,6 +66,8 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+typedef struct RewriteStateData *RewriteState;
+
/*
* This struct is used to pass around the information on tables to be
* clustered. We need this so we can make a list of them when invoked without
@@ -67,17 +79,183 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being processed by this backend.
+ */
+static Oid clustered_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid clustered_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator clustered_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
+
+/* XXX Do we also need to mention VACUUM FULL CONCURRENTLY? */
+#define CLUSTER_IN_PROGRESS_MESSAGE \
+ "relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTER CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables clustered_rel_locator
+ * and clustered_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if CLUSTER CONCURRENTLY is running, before they start to
+ * do the actual changes (see is_concurrent_cluster_in_progress()). Anything
+ * else must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo cluster_current_segment = 0;
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ LOCKMODE lockmode, bool isTopLevel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent, bool is_vacuum);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
Oid indexOid);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static void check_concurrent_cluster_requirements(Relation rel,
+ bool isTopLevel,
+ bool isCluster);
+static void begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_cluster(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel, bool is_vacuum);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(ClusterDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -109,10 +287,12 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
ListCell *lc;
ClusterParams params = {0};
bool verbose = false;
+ bool concurrent = false;
Relation rel = NULL;
Oid indexOid = InvalidOid;
MemoryContext cluster_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -121,6 +301,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (strcmp(opt->defname, "verbose") == 0)
verbose = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -129,20 +311,30 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENT case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
Oid tableOid;
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
+ /* Find, lock, and check permissions on the table. */
tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -194,7 +386,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, ¶ms);
+ cluster_rel(rel, indexOid, ¶ms, isTopLevel, false);
/* cluster_rel closes the relation, but keeps lock */
return;
@@ -202,10 +394,29 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("CLUSTER (CONCURRENTLY) not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("CLUSTER (CONCURRENTLY) requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
*/
PreventInTransactionBlock(isTopLevel, "CLUSTER");
@@ -230,11 +441,14 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
check_index_is_clusterable(rel, indexOid, AccessShareLock);
rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
{
@@ -243,7 +457,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, lockmode, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +474,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, LOCKMODE lockmode,
+ bool isTopLevel)
{
ListCell *lc;
@@ -280,10 +495,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, isTopLevel, false);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +521,16 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel, bool isVacuum)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,8 +539,46 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY after
+ * our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_cluster_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ check_concurrent_cluster_requirements(OldHeap, isTopLevel,
+ OidIsValid(indexOid));
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ check_relation_is_clusterable_concurrently(OldHeap, isVacuum);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -355,7 +615,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -370,7 +630,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -381,7 +641,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -392,7 +652,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -408,6 +668,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENT case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -435,7 +700,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -452,7 +717,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -465,11 +731,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_cluster(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose, concurrent, isVacuum);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_cluster(!success);
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -615,18 +912,86 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+void
+check_relation_is_clusterable_concurrently(Relation rel, bool is_vacuum)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+ const char *cmd = is_vacuum ? "VACUUM" : "CLUSTER";
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is not supported for catalog relations.",
+ cmd)));
+
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is not supported for TOAST relations, unless the main relation is processed too.",
+ cmd)));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("%s (CONCURRENTLY) is only allowed for permanent relations.",
+ cmd)));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ bool concurrent, bool is_vacuum)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -634,13 +999,81 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * CLUSTER CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple CLUSTER commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "cluster_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we spend notable amount of time to
+ * setup the logical decoding. Not sure though if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap, is_vacuum);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ snapshot = SnapBuildInitialSnapshotForCluster(ctx->snapshot_builder);
+ }
if (index)
/* Mark the correct index as clustered */
@@ -648,7 +1081,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -664,30 +1096,51 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -822,15 +1275,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -847,6 +1304,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
int elevel = verbose ? INFO : DEBUG2;
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -873,8 +1331,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the CONCURRENT case, the lock does not help because we need to
+ * release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in ClusteredRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = CurrentResourceOwner;
+
+ /*
+ * In the CONCURRENT case, do the planning in a subtransaction so that
+ * we don't leave any additional locks behind us that we cannot
+ * release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+ BeginInternalSubTransaction("plan_cluster_use_sort");
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ PgBackendProgress progress;
+
+ /*
+ * Command progress reporting gets terminated at subtransaction
+ * end. Save the status so it can be eventually restored.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress,
+ sizeof(PgBackendProgress));
+
+ /* Release the locks by aborting the subtransaction. */
+ RollbackAndReleaseCurrentSubTransaction();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+
+ CurrentResourceOwner = oldowner;
+ }
+ }
else
use_sort = false;
@@ -980,7 +1480,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -989,7 +1491,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENT case, we need to set it again before applying
+ * the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1442,14 +1948,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1475,39 +1980,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
@@ -1747,3 +2259,1886 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+#define REPL_PLUGIN_NAME "pgoutput_cluster"
+
+/*
+ * Each relation being processed by CLUSTER CONCURRENTLY must be in the
+ * clusteredRels hashtable.
+ */
+typedef struct ClusteredRel
+{
+ Oid relid;
+ Oid dbid;
+} ClusteredRel;
+
+static HTAB *ClusteredRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxClusteredRels = 0;
+
+Size
+ClusterShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Reserve also space for TOAST
+ * relations.
+ */
+ maxClusteredRels = max_replication_slots * 2;
+
+ return hash_estimate_size(maxClusteredRels, sizeof(ClusteredRel));
+}
+
+void
+ClusterShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(ClusteredRel);
+ info.entrysize = info.keysize;
+
+ ClusteredRelsHash = ShmemInitHash("Clustered Relations",
+ maxClusteredRels,
+ maxClusteredRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Perform a preliminary check whether CLUSTER / VACUUM FULL CONCURRENTLY is
+ * possible. Note that here we only check things that should not change if we
+ * release the relation lock temporarily. The information that can change due
+ * to unlocking is checked in get_catalog_state().
+ */
+static void
+check_concurrent_cluster_requirements(Relation rel, bool isTopLevel,
+ bool isCluster)
+{
+ const char *stmt;
+
+ if (isCluster)
+ stmt = "CLUSTER (CONCURRENTLY)";
+ else
+ stmt = "VACUUM (FULL, CONCURRENTLY)";
+
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ */
+ PreventInTransactionBlock(isTopLevel, stmt);
+
+ CheckSlotPermissions();
+
+ /*
+ * Use an existing function to check if we can use logical
+ * decoding. However note that RecoveryInProgress() should already have
+ * caused error, as it does for the non-concurrent VACUUM FULL / CLUSTER.
+ */
+ CheckLogicalDecodingRequirements();
+
+ /* See ClusterShmemSize() */
+ if (max_replication_slots < 2)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ (errmsg("%s requires \"max_replication_slots\" to be at least 2",
+ stmt)));
+}
+
+/*
+ * Call this function before CLUSTER CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_cluster(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ ClusteredRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in ClusteredRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since CLUSTER CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(clustered_rel));
+ clustered_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENT case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by CLUSTER CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for CLUSTER CONCURRENT at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(clustered_rel_toast));
+ clustered_rel_toast = toastrelid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+ /* Make sure the reopened relcache entry is used, not the old one. */
+ rel = *rel_p;
+
+ /* Avoid logical decoding of other relations by this backend. */
+ clustered_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ clustered_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with CLUSTER CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_cluster(bool error)
+{
+ ClusteredRel key;
+ ClusteredRel *entry = NULL, *entry_toast = NULL;
+ Oid relid = clustered_rel;
+ Oid toastrelid = clustered_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(clustered_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = clustered_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(ClusteredRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(ClusteredRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ clustered_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(clustered_rel_toast))
+ {
+ key.relid = clustered_rel_toast;
+ entry_toast = hash_search(ClusteredRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ clustered_rel_toast = InvalidOid;
+ }
+ LWLockRelease(ClusteredRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ clustered_rel_locator.relNumber = InvalidOid;
+ clustered_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that CLUSTER CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among clustered relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_cluster(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(clustered_rel) || OidIsValid(clustered_rel_toast))
+ end_concurrent_cluster(true);
+}
+
+/*
+ * Check if relation is currently being processed by CLUSTER CONCURRENTLY.
+ */
+bool
+is_concurrent_cluster_in_progress(Oid relid)
+{
+ ClusteredRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(ClusteredRelsLock, LW_SHARED);
+ entry = (ClusteredRel *)
+ hash_search(ClusteredRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(ClusteredRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if VACUUM FULL / CLUSTER CONCURRENTLY is already running for given
+ * relation, and if so, raise ERROR. The problem is that cluster_rel() needs
+ * to release its lock on the relation temporarily at some point, so our lock
+ * alone does not help. Commands that might break what cluster_rel() is doing
+ * should call this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * ShareUpdateExclusiveLock, the check makes little sense because the
+ * VACUUM FULL / CLUSTER CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < ShareUpdateExclusiveLock)
+ return;
+
+ if (is_concurrent_cluster_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(CLUSTER_IN_PROGRESS_MESSAGE,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for CLUSTER CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel, bool is_vacuum)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ *
+ * This function was already called, but the relation was unlocked since
+ * (see begin_concurrent_cluster()). check_catalog_changes() should catch
+ * any "disruptive" changes in the future.
+ */
+ check_relation_is_clusterable_concurrently(rel, is_vacuum);
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of CLUSTER CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != clustered_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_cluster() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, clustered_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ clustered_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ *
+ * XXX Even though CreateInitDecodingContext() does not set state to
+ * RS_PERSISTENT, it does write the slot to disk. We rely on
+ * RestoreSlotFromDisk() to delete ephemeral slots during startup. (Both ERROR
+ * and FATAL should lead to cleanup even before the cluster goes down.)
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ ClusterDecodingState *dstate;
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in ClusteredRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, cluster_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(ClusterDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ ClusterDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != cluster_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "cluster: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ cluster_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw, *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ ClusterDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_CATCH_UP);
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ cluster_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
+ PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c12817091e..8f0197378d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -905,7 +905,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4fc54bd6eb..f070ffc960 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4500,6 +4500,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER CONCURRENTLY is in
+ * progress. If lockmode is too weak, cluster_rel() should detect
+ * incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5929,6 +5939,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e6745e6145..516a064fed 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -111,7 +111,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -153,6 +153,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
bool analyze = false;
bool freeze = false;
bool full = false;
+ bool concurrent = false;
bool disable_page_skipping = false;
bool process_main = true;
bool process_toast = true;
@@ -226,6 +227,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
freeze = defGetBoolean(opt);
else if (strcmp(opt->defname, "full") == 0)
full = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "concurrently") == 0)
+ concurrent = defGetBoolean(opt);
else if (strcmp(opt->defname, "disable_page_skipping") == 0)
disable_page_skipping = defGetBoolean(opt);
else if (strcmp(opt->defname, "index_cleanup") == 0)
@@ -300,7 +303,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
(analyze ? VACOPT_ANALYZE : 0) |
(freeze ? VACOPT_FREEZE : 0) |
- (full ? VACOPT_FULL : 0) |
+ (full ? (concurrent ? VACOPT_FULL_CONCURRENT : VACOPT_FULL_EXCLUSIVE) : 0) |
(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
(process_main ? VACOPT_PROCESS_MAIN : 0) |
(process_toast ? VACOPT_PROCESS_TOAST : 0) |
@@ -380,6 +383,12 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
errmsg("ONLY_DATABASE_STATS cannot be specified with other VACUUM options")));
}
+ /* This problem cannot be identified from the options. */
+ if (concurrent && !full)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("CONCURRENTLY can only be specified with VACUUM FULL")));
+
/*
* All freeze ages are zero if the FREEZE option is given; otherwise pass
* them as -1 which means to use the default values.
@@ -543,7 +552,17 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
relations = newrels;
}
else
+ {
+ /*
+ * Concurrent processing is currently considered rather special so it
+ * is not performed in bulk.
+ */
+ if (params->options & VACOPT_FULL_CONCURRENT)
+ ereport(ERROR,
+ (errmsg("VACUUM (CONCURRENTLY) requires explicit list of tables")));
+
relations = get_all_vacuum_rels(vac_context, params->options);
+ }
/*
* Decide whether we need to start/commit our own transactions.
@@ -616,7 +635,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -960,6 +980,17 @@ expand_vacuum_rel(VacuumRelation *vrel, MemoryContext vac_context,
(errmsg("VACUUM ONLY of partitioned table \"%s\" has no effect",
vrel->relation->relname)));
+ /*
+ * Concurrent processing is currently considered rather special
+ * (e.g. in terms of resources consumed) so it is not performed in
+ * bulk.
+ */
+ if (is_partitioned_table && (options & VACOPT_FULL_CONCURRENT))
+ ereport(ERROR,
+ (errmsg("VACUUM (CONCURRENTLY) not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+
+
ReleaseSysCache(tuple);
/*
@@ -1954,10 +1985,10 @@ vac_truncate_clog(TransactionId frozenXID,
/*
* vacuum_rel() -- vacuum one heap relation
*
- * relid identifies the relation to vacuum. If relation is supplied,
- * use the name therein for reporting any failure to open/lock the rel;
- * do not use it once we've successfully opened the rel, since it might
- * be stale.
+ * relid identifies the relation to vacuum. If relation is supplied, use
+ * the name therein for reporting any failure to open/lock the rel; do
+ * not use it once we've successfully opened the rel, since it might be
+ * stale.
*
* Returns true if it's okay to proceed with a requested ANALYZE
* operation on this table.
@@ -1972,7 +2003,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2035,10 +2066,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/*
* Determine the type of lock we want --- hard exclusive lock for a FULL
- * vacuum, but just ShareUpdateExclusiveLock for concurrent vacuum. Either
- * way, we can be sure that no other backend is vacuuming the same table.
+ * exclusive vacuum, but a weaker lock (ShareUpdateExclusiveLock) for
+ * concurrent vacuum. Either way, we can be sure that no other backend is
+ * vacuuming the same table.
*/
- lmode = (params->options & VACOPT_FULL) ?
+ lmode = (params->options & VACOPT_FULL_EXCLUSIVE) ?
AccessExclusiveLock : ShareUpdateExclusiveLock;
/* open the relation and get the appropriate lock on it */
@@ -2053,6 +2085,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return false;
}
+ /*
+ * Skip the relation if VACUUM FULL / CLUSTER CONCURRENTLY is in progress
+ * as it will drop the current storage of the relation.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VACUUM FULL / CLUSTER CONCURRENTLY later.
+ */
+ Assert(lmode >= ShareUpdateExclusiveLock);
+ if (is_concurrent_cluster_in_progress(relid))
+ {
+ relation_close(rel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
/*
* When recursing to a TOAST table, check privileges on the parent. NB:
* This is only safe to do because we hold a session lock on the main
@@ -2126,19 +2174,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
return true;
}
- /*
- * Get a session-level lock too. This will protect our access to the
- * relation across multiple transactions, so that we can vacuum the
- * relation's TOAST table (if any) secure in the knowledge that no one is
- * deleting the parent relation.
- *
- * NOTE: this cannot block, even if someone else is waiting for access,
- * because the lock manager knows that both lock requests are from the
- * same process.
- */
- lockrelid = rel->rd_lockInfo.lockRelId;
- LockRelationIdForSession(&lockrelid, lmode);
-
/*
* Set index_cleanup option based on index_cleanup reloption if it wasn't
* specified in VACUUM command, or when running in an autovacuum worker
@@ -2191,6 +2226,30 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
else
toast_relid = InvalidOid;
+ /*
+ * Get a session-level lock too. This will protect our access to the
+ * relation across multiple transactions, so that we can vacuum the
+ * relation's TOAST table (if any) secure in the knowledge that no one is
+ * deleting the parent relation.
+ *
+ * NOTE: this cannot block, even if someone else is waiting for access,
+ * because the lock manager knows that both lock requests are from the
+ * same process.
+ */
+ if (OidIsValid(toast_relid))
+ {
+ /*
+ * You might worry that, in the VACUUM (FULL, CONCURRENTLY) case,
+ * cluster_rel() needs to release all the locks on the relation at
+ * some point, but this session lock makes it impossible. In fact,
+ * cluster_rel() will will eventually be called for the TOAST relation
+ * and raise ERROR because, in the concurrent mode, it cannot process
+ * TOAST relation alone anyway.
+ */
+ lockrelid = rel->rd_lockInfo.lockRelId;
+ LockRelationIdForSession(&lockrelid, lmode);
+ }
+
/*
* Switch to the table owner's userid, so that any index functions are run
* as that user. Also lock down security-restricted operations and
@@ -2218,11 +2277,22 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
{
ClusterParams cluster_params = {0};
+ /*
+ * Invalid toast_relid means that there is no session lock on the
+ * relation. Such a lock would be a problem because it would
+ * prevent cluster_rel() from releasing all locks when it tries to
+ * get AccessExclusiveLock.
+ */
+ Assert(!OidIsValid(toast_relid));
+
if ((params->options & VACOPT_VERBOSE) != 0)
cluster_params.options |= CLUOPT_VERBOSE;
+ if ((params->options & VACOPT_FULL_CONCURRENT) != 0)
+ cluster_params.options |= CLUOPT_CONCURRENT;
+
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params, isTopLevel, true);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2268,13 +2338,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
* Now release the session-level lock on the main table.
*/
- UnlockRelationIdForSession(&lockrelid, lmode);
+ if (OidIsValid(toast_relid))
+ UnlockRelationIdForSession(&lockrelid, lmode);
/* Report that we really did it. */
return true;
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db21480..4b00a9b8c6 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_cluster')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0bff0f1065..8f45a7a168 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
+ * so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ OidIsValid(clustered_rel_locator.relNumber));
+
+ if (OidIsValid(clustered_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, clustered_rel_locator) &&
+ (!OidIsValid(clustered_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, clustered_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 84bf0503a5..86a9b0335a 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by CLUSTER
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_cluster/Makefile b/src/backend/replication/pgoutput_cluster/Makefile
new file mode 100644
index 0000000000..31471bb546
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_cluster
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_cluster
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_cluster
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_cluster.o
+PGFILEDESC = "pgoutput_cluster - logical replication output plugin for CLUSTER command"
+NAME = pgoutput_cluster
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_cluster/meson.build b/src/backend/replication/pgoutput_cluster/meson.build
new file mode 100644
index 0000000000..0f033064f2
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_cluster_sources = files(
+ 'pgoutput_cluster.c',
+)
+
+if host_system == 'windows'
+ pgoutput_cluster_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_cluster',
+ '--FILEDESC', 'pgoutput_cluster - logical replication output plugin for CLUSTER command',])
+endif
+
+pgoutput_cluster = shared_module('pgoutput_cluster',
+ pgoutput_cluster_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_cluster
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
new file mode 100644
index 0000000000..43f7b34297
--- /dev/null
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -0,0 +1,288 @@
+/* TODO Move into src/backend/cluster/ (and rename?) */
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for CLUSTER command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ ClusterDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ ClusterDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst, *dst_start;
+
+ dstate = (ClusterDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
+
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed7036..56df243a8a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, ClusterShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ ClusterShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d5801..7e5484dae4 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1302,6 +1302,17 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if VACUUM FULL / CLUSTER
+ * CONCURRENT is in progress. If lockmode is too weak,
+ * cluster_rel() should detect incompatible DDLs executed
+ * by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_cluster(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index eebc968193..e2c84baba9 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -162,3 +162,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807..a28fba18a7 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+ClusteredRels "Waiting to read or update information on tables being clustered concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 254947e943..e2d030e6a1 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1565,6 +1565,28 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in VACUUM FULL/CLUSTER CONCURRENTLY, to make sure
+ * that other backends are aware that the command is being executed for the
+ * relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 43219a9629..55557c7589 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1246,6 +1247,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is CLUSTER CONCURRENTLY in progress? */
+ relation->rd_cluster_concurrent =
+ is_concurrent_cluster_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 42bded373b..103d1249bb 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,7 +154,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -587,7 +586,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 81cbf10aa2..14c2058997 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3119,7 +3119,7 @@ match_previous_words(int pattern_id,
* one word, so the above test is correct.
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("VERBOSE");
+ COMPLETE_WITH("VERBOSE", "CONCURRENTLY");
}
/* COMMENT */
@@ -5133,7 +5133,8 @@ match_previous_words(int pattern_id,
"DISABLE_PAGE_SKIPPING", "SKIP_LOCKED",
"INDEX_CLEANUP", "PROCESS_MAIN", "PROCESS_TOAST",
"TRUNCATE", "PARALLEL", "SKIP_DATABASE_STATS",
- "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT");
+ "ONLY_DATABASE_STATS", "BUFFER_USAGE_LIMIT",
+ "CONCURRENTLY");
else if (TailMatches("FULL|FREEZE|ANALYZE|VERBOSE|DISABLE_PAGE_SKIPPING|SKIP_LOCKED|PROCESS_MAIN|PROCESS_TOAST|TRUNCATE|SKIP_DATABASE_STATS|ONLY_DATABASE_STATS"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("INDEX_CLEANUP"))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83f..9b1fb15d8c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -413,6 +413,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0..8efe1c23de 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -629,6 +630,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1672,6 +1675,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1684,6 +1691,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1692,6 +1701,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5e..66431cc19e 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cb..d420930d6b 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -31,12 +37,91 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+extern RelFileLocator clustered_rel_locator;
+extern RelFileLocator clustered_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct ClusterDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} ClusterDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ bool isTopLevel, bool isVacuum);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void check_relation_is_clusterable_concurrently(Relation rel,
+ bool is_vacuum);
+extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -44,8 +129,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size ClusterShmemSize(void);
+extern void ClusterShmemInit(void);
+extern bool is_concurrent_cluster_in_progress(Oid relid);
+extern void check_for_concurrent_cluster(Oid relid, LOCKMODE lockmode);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef6..e1d53f2a0a 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,19 +59,22 @@
#define PROGRESS_CLUSTER_PHASE 1
#define PROGRESS_CLUSTER_INDEX_RELID 2
#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+#define PROGRESS_CLUSTER_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_CLUSTER_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_CLUSTER_HEAP_TUPLES_DELETED 6
+#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 7
+#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 8
+#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 9
/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_CLUSTER_PHASE_CATCH_UP 5
+#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 7
+#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12d0b61950..05c1fce969 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -181,13 +181,16 @@ typedef struct VacAttrStats
#define VACOPT_ANALYZE 0x02 /* do ANALYZE */
#define VACOPT_VERBOSE 0x04 /* output INFO instrumentation messages */
#define VACOPT_FREEZE 0x08 /* FREEZE option */
-#define VACOPT_FULL 0x10 /* FULL (non-concurrent) vacuum */
-#define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
-#define VACOPT_PROCESS_MAIN 0x40 /* process main relation */
-#define VACOPT_PROCESS_TOAST 0x80 /* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */
-#define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */
-#define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */
+#define VACOPT_FULL_EXCLUSIVE 0x10 /* FULL (non-concurrent) vacuum */
+#define VACOPT_FULL_CONCURRENT 0x20 /* FULL (concurrent) vacuum */
+#define VACOPT_SKIP_LOCKED 0x40 /* skip if cannot get lock */
+#define VACOPT_PROCESS_MAIN 0x80 /* process main relation */
+#define VACOPT_PROCESS_TOAST 0x100 /* process the TOAST table, if any */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x200 /* don't skip any pages */
+#define VACOPT_SKIP_DATABASE_STATS 0x400 /* skip vac_update_datfrozenxid() */
+#define VACOPT_ONLY_DATABASE_STATS 0x800 /* only vac_update_datfrozenxid() */
+
+#define VACOPT_FULL (VACOPT_FULL_EXCLUSIVE | VACOPT_FULL_CONCURRENT)
/*
* Values used by index_cleanup and truncate params.
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814..7c2e1f2ceb 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForCluster(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f..b7b94dacf8 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,9 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, CLUSTER
+ * CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf56545238..9211124d10 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(54, ClusteredRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 739629cb21..5b9903de08 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -35,7 +35,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 40658ba2ff..6b2faed672 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -49,6 +49,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 33d1e4a4e2..243b80d7fa 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is CLUSTER CONCURRENTLY being performed on this relation? */
+ bool rd_cluster_concurrent;
} RelationData;
@@ -684,7 +687,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_cluster_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210..5eeabdc6c4 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fe..81300642a5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1962,17 +1962,20 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
--
2.45.2
v07-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 46a22614127c4cf1d45a0c47ccafbfac3243fee5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 5/8] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while VACUUM FULL / CLUSTER CONCURRENTLY is
copying the table contents to a new file are decoded from WAL and eventually
also applied to the new file. To reduce the complexity a little bit, the
preceding patch uses the current transaction (i.e. transaction opened by the
VACUUM FULL / CLUSTER command) to execute those INSERT, UPDATE and DELETE
commands.
However, neither VACUUM nor CLUSTER is expected to change visibility of
tuples. Therefore, this patch fixes the handling of the "concurrent data
changes". Now the tuples written into the new table storage have the same XID
and command ID (CID) as they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_cluster/pgoutput_cluster.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
src/include/utils/snapshot.h | 3 +
13 files changed, 389 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346c..75d889ec72 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 552993d4ef..e20634c030 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -58,7 +58,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -1969,7 +1970,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -1985,15 +1986,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2624,7 +2626,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2681,11 +2684,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2702,6 +2705,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2927,7 +2931,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -2995,8 +3000,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3017,6 +3026,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
@@ -3106,10 +3124,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3148,12 +3167,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3193,6 +3211,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -3985,8 +4004,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -3996,7 +4019,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4351,10 +4375,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8685,7 +8709,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8696,10 +8721,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c5ec21ca2f..c315abac02 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d7..e6a7414f9e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when CLUSTER CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nClusterCurrentXids = 0;
+static TransactionId *ClusterCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nClusterCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nClusterCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ ClusterCurrentXids,
+ nClusterCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5628,6 +5657,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetClusterCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetClusterCurrentXids(TransactionId *xip, int xcnt)
+{
+ ClusterCurrentXids = xip;
+ nClusterCurrentXids = xcnt;
+}
+
+/*
+ * ResetClusterCurrentXids
+ * Undo the effect of SetClusterCurrentXids().
+ */
+void
+ResetClusterCurrentXids(void)
+{
+ ClusterCurrentXids = NULL;
+ nClusterCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index c9cc061c45..3a1a51a56a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -200,6 +200,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2971,6 +2972,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3126,6 +3130,7 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw, *src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3194,8 +3199,30 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3206,6 +3233,8 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetClusterCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3218,11 +3247,14 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3242,10 +3274,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetClusterCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3253,6 +3305,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3263,6 +3316,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetClusterCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3280,18 +3335,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3301,6 +3374,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3311,7 +3385,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_CLUSTER_HEAP_TUPLES_DELETED, 1);
}
@@ -3329,7 +3418,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3337,7 +3426,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3409,6 +3498,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetClusterCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8f45a7a168..23766ccfb6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if CLUSTER CONCURRENTLY is being performed by this backend. If
- * so, only decode data changes of the table that it is processing, and
- * the changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if CLUSTER CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +986,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 86a9b0335a..2d4ce8b37f 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForCluster(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
index 43f7b34297..8e915c55fb 100644
--- a/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+++ b/src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
@@ -33,7 +33,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -101,6 +102,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
ClusterDecodingState *dstate;
+ Snapshot snapshot;
dstate = (ClusterDecodingState *) ctx->output_writer_private;
@@ -108,6 +110,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -125,7 +169,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -142,9 +186,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -157,7 +203,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -191,13 +237,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
ClusterDecodingState *dstate;
char *change_raw;
@@ -265,6 +311,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -278,6 +329,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9b1fb15d8c..5a6444e969 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -317,21 +317,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf..8d4af07f84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee04..1a0b173d48 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetClusterCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetClusterCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index d420930d6b..8945e46e64 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -58,6 +58,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -89,6 +97,8 @@ typedef struct ClusterDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -109,6 +119,14 @@ typedef struct ClusterDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} ClusterDecodingState;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec149..014f27db7d 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
--
2.45.2
v07-0006-Add-regression-tests.patchtext/x-diffDownload
From 4d1b029b59a33973cdcd475ed4e4748375557843 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 6/8] Add regression tests.
As this patch series adds the CONCURRENTLY option to the VACUUM FULL and
CLUSTER commands, it's appropriate to test that the "concurrent data changes"
(i.e. changes done by application while we are copying the table contents to
the new storage) are processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the CLUSTER (CONCURRENTLY) command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/cluster.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 3 +
.../injection_points/specs/cluster.spec | 140 ++++++++++++++++++
6 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/cluster.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/cluster.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3a1a51a56a..8e73da73ee 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3721,6 +3722,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("cluster-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58..e40ebec1bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace cluster
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
new file mode 100644
index 0000000000..d84fff3693
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+
+step change_new:
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226..8c404ddd61 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,10 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'cluster',
],
+ # 'cluster' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
new file mode 100644
index 0000000000..5f8404c5da
--- /dev/null
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE clstr_test(i int PRIMARY KEY, j int);
+ INSERT INTO clstr_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE clstr_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE clstr_test SET i=10 where i=1;
+ UPDATE clstr_test SET j=20 where i=2;
+ UPDATE clstr_test SET i=30 where i=3;
+ UPDATE clstr_test SET i=40 where i=30;
+ DELETE FROM clstr_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO clstr_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE clstr_test SET i=50 where i=5;
+ UPDATE clstr_test SET j=60 where i=6;
+ DELETE FROM clstr_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO clstr_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE clstr_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE clstr_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO clstr_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='clstr_test';
+
+ SELECT i, j FROM clstr_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM clstr_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing CLUSTER
+# (CONCURRENTLY) find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.45.2
v07-0007-Introduce-cluster_max_xlock_time-configuration-varia.patchtext/x-diffDownload
From 6c314f3e135938def75aa8b9efde86f782503d10 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 7/8] Introduce cluster_max_xlock_time configuration variable.
When executing VACUUM FULL / CLUSTER (CONCURRENTLY) we need the
AccessExclusiveLock to swap the relation files and that should require pretty
short time. However, on a busy system, other backends might change
non-negligible amount of data in the table while we are waiting for the
lock. Since these changes must be applied to the new storage before the swap,
the time we eventually hold the lock might become non-negligible too.
If the user is worried about this situation, he can set cluster_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the VACUUM FULL / CLUSTER (CONCURRENTLY)
command, ERROR is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 32 +++++
doc/src/sgml/ref/cluster.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/cluster.out | 74 +++++++++-
.../injection_points/specs/cluster.spec | 42 ++++++
9 files changed, 293 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f41a17b1f..695d1fe2a4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10701,6 +10701,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-cluster-max-xclock-time" xreflabel="cluster_max_xlock_time">
+ <term><varname>cluster_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>cluster_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by commands <command>CLUSTER</command> and <command>VACUUM
+ FULL</command> with the <literal>CONCURRENTLY</literal>
+ option. Typically, these commands should not need the lock for longer
+ time than <command>TRUNCATE</command> does. However, additional time
+ might be needed if the system is too busy. (See
+ <xref linkend="sql-cluster"/> for explanation how
+ the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 356b40e3fe..ebb85d9d47 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -141,7 +141,14 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>CLUSTER</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-cluster-max-xclock-time"><varname>cluster_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c315abac02..13c67f0e5f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1004,7 +1004,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- cluster_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ cluster_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8e73da73ee..dfae550123 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -102,6 +104,15 @@ RelFileLocator clustered_rel_toast_locator = {.relNumber = InvalidOid};
#define CLUSTER_IN_PROGRESS_MESSAGE \
"relation \"%s\" is already being processed by CLUSTER CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int cluster_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -188,7 +199,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(ClusterDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -205,13 +217,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3027,7 +3041,8 @@ get_changed_tuple(char *change)
*/
void
cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3065,6 +3080,9 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3107,7 +3125,8 @@ cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3137,6 +3156,9 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3261,10 +3283,22 @@ apply_concurrent_changes(ClusterDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3467,11 +3501,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
ClusterDecodingState *dstate;
@@ -3480,10 +3518,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (ClusterDecodingState *) ctx->output_writer_private;
- cluster_decode_concurrent_changes(ctx, end_of_wal);
+ cluster_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3495,7 +3542,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3505,6 +3552,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3665,6 +3734,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3744,7 +3815,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3866,9 +3938,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * cluster_max_xlock_time is not exceeded.
+ */
+ if (cluster_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + cluster_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (cluster_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("cluster-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"cluster_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c9d8cd796a..7f4686f31e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2791,6 +2792,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"cluster_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for VACUUM FULL / CLUSTER (CONCURRENTLY) to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &cluster_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b2bc43383d..eef7be70c5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -728,6 +728,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_failsafe_age = 1600000000
+#cluster_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 8945e46e64..72221e71d5 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -44,6 +44,8 @@ typedef struct ClusterParams
extern RelFileLocator clustered_rel_locator;
extern RelFileLocator clustered_rel_toast_locator;
+extern PGDLLIMPORT int cluster_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -139,7 +141,8 @@ extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void check_relation_is_clusterable_concurrently(Relation rel,
bool is_vacuum);
extern void cluster_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/cluster.out b/src/test/modules/injection_points/expected/cluster.out
index d84fff3693..646e31448f 100644
--- a/src/test/modules/injection_points/expected/cluster.out
+++ b/src/test/modules/injection_points/expected/cluster.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/cluster.spec b/src/test/modules/injection_points/specs/cluster.spec
index 5f8404c5da..9af41bac6d 100644
--- a/src/test/modules/injection_points/specs/cluster.spec
+++ b/src/test/modules/injection_points/specs/cluster.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('cluster-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('cluster-concurrently-after-lock', 'wait');
+ SET cluster_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ CLUSTER (CONCURRENTLY) clstr_test USING clstr_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('cluster-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('cluster-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing CLUSTER
# (CONCURRENTLY) find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the cluster_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, CLUSTER
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# cluster_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.45.2
Hi,
On Sat, Jan 11, 2025 at 09:01:54AM -0500, Andrew Dunstan wrote:
On 2025-01-09 Th 8:35 AM, Alvaro Herrera wrote:
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZE
I don't like any of them a lot :-/
COMPACT tablename ...
That sounds like it would compress content rather than just rewrite it
normally to get rid of bloat.
I think REORG (or REPACK, but that has not history elsewhere) would fit
best, we don't need to emulate the myriad of DB2 options...
Michael
On Mon, Jan 13, 2025 at 8:56 AM Michael Banck <mbanck@gmx.net> wrote:
Hi,
On Sat, Jan 11, 2025 at 09:01:54AM -0500, Andrew Dunstan wrote:On 2025-01-09 Th 8:35 AM, Alvaro Herrera wrote:
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZEI don't like any of them a lot :-/
Agreed, though I do believe there would be a positive gain from
eliminating the overloaded CLUSTER term.
COMPACT tablename ...
That sounds like it would compress content rather than just rewrite it
normally to get rid of bloat.I think REORG (or REPACK, but that has not history elsewhere) would fit
best, we don't need to emulate the myriad of DB2 options...
I would like REPACK if I didn't believe it would lead to confusion
with pg_repack (which, afaict, seems to have better performance
characteristics, so will probably hang around).
Actually, I wonder if we are too focused on the idea this is a
vaccum/bloat related tool. The original idea behind CLUSTER was not
related to vacuum or bloat management, but performance. There are
other reasons to want to rewrite a table as well (think dropped
columns or new column defaults). Is ALTER TABLE <table> REWRITE an
option? Current needed options would be for clustering or running
concurrently, but even without those options sometimes you just want
to rewrite the table, and this is probably the most straightforward
than making something up.
Robert Treat
https://xzilla.net
On 2025-01-15 We 11:13 AM, Robert Treat wrote:
On Mon, Jan 13, 2025 at 8:56 AM Michael Banck <mbanck@gmx.net> wrote:
Hi,
On Sat, Jan 11, 2025 at 09:01:54AM -0500, Andrew Dunstan wrote:On 2025-01-09 Th 8:35 AM, Alvaro Herrera wrote:
Maybe we should have a new toplevel command. Some ideas that have been
thrown around:- RETABLE (it's like REINDEX, but for tables)
- ALTER TABLE <tab> SQUEEZE
- SQUEEZE <table>
- VACUUM (SQUEEZE)
- VACUUM (COMPACT)
- MAINTAIN <tab> COMPACT
- MAINTAIN <tab> SQUEEZEI don't like any of them a lot :-/
Agreed, though I do believe there would be a positive gain from
eliminating the overloaded CLUSTER term.COMPACT tablename ...
That sounds like it would compress content rather than just rewrite it
normally to get rid of bloat.I think REORG (or REPACK, but that has not history elsewhere) would fit
best, we don't need to emulate the myriad of DB2 options...I would like REPACK if I didn't believe it would lead to confusion
with pg_repack (which, afaict, seems to have better performance
characteristics, so will probably hang around).Actually, I wonder if we are too focused on the idea this is a
vaccum/bloat related tool. The original idea behind CLUSTER was not
related to vacuum or bloat management, but performance. There are
other reasons to want to rewrite a table as well (think dropped
columns or new column defaults). Is ALTER TABLE <table> REWRITE an
option? Current needed options would be for clustering or running
concurrently, but even without those options sometimes you just want
to rewrite the table, and this is probably the most straightforward
than making something up.
I really don't like any of the ALTER TABLE variants, because that's
about changing the table's definition, and this operation doesn't do
that. I could live with REORG as a top level verb if you don't like COMPACT.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hi,
On Mon, Jan 13, 2025 at 02:48:31PM +0100, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-09, Antonin Houska wrote:
It seems you accidentally fixed another problem :-) I was referring to the
'lockmode' argument of make_new_heap(). I can try to write a patch for that
but ...Meanwhile the patch 0004 has some seemingly trivial conflicts. If you
want to rebase, I'd appreciate that. In the meantime I'll give a look
at the next two other API changes.This is the patch series rebased on top of the commit cc811f92ba.
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.
Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
Michael
On 2025-Jan-30, Michael Banck wrote:
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.
For the record, there was an observation that 1) if logical decoding is
not enabled, REPACK CONCURRENTLY would not work, and 2) that sites being
forced to enable logical decoding (even if transiently) to allow this,
might take a considerable performance hit, and that we shouldn't
entangle our features in that way. I don't have an opinion on these
things at this point; knowing more about exactly what the performance
impact is would be good. Regarding logical decoding, the conversation
continued that maybe it'd be good if the feature can be automatically
enabled transiently for that particular table for as long as needed, and
disabled afterwards. But like with the previous concern, I don't really
have an opinion without understanding it more deeply.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-30, Michael Banck wrote:
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.
Thanks for discussing the patch.
I assume the patch should mark CLUSTER deprecated rather than removing it
immediately.
For the record, there was an observation that 1) if logical decoding is
not enabled, REPACK CONCURRENTLY would not work, and 2) that sites being
forced to enable logical decoding (even if transiently) to allow this,
might take a considerable performance hit, and that we shouldn't
entangle our features in that way. I don't have an opinion on these
things at this point; knowing more about exactly what the performance
impact is would be good. Regarding logical decoding, the conversation
continued that maybe it'd be good if the feature can be automatically
enabled transiently for that particular table for as long as needed, and
disabled afterwards. But like with the previous concern, I don't really
have an opinion without understanding it more deeply.
Enabling the logical decoding transiently makes sense to me.
I also agree that tables not being REPACKed should be treated as not being
logically decoded, i.e. the logical decoding specific information should not
be written to WAL for them. Neither time nor energy should be wasted :-)
I'll try to implement these requirements the next version.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2025-Jan-31, Antonin Houska wrote:
I assume the patch should mark CLUSTER deprecated rather than removing it
immediately.
Yeah, we should certainly not make any statements fail that work today.
Same goes for VACUUM FULL.
I also agree that tables not being REPACKed should be treated as not being
logically decoded, i.e. the logical decoding specific information should not
be written to WAL for them. Neither time nor energy should be wasted :-)
Cool.
Something that Robert Haas just mentioned to me is handling of row
locks: if concurrent transactions are keeping rows in the original table
locked (especially SELECT FOR KEY SHARE, since that's not considered by
logical decoding at present and it would be possible to break foreign
keys if we just do nothing), them we need these to be "transferred" to
the new table somehow.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"¿Qué importan los años? Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo" (Mafalda)
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-31, Antonin Houska wrote:
Something that Robert Haas just mentioned to me is handling of row
locks: if concurrent transactions are keeping rows in the original table
locked (especially SELECT FOR KEY SHARE, since that's not considered by
logical decoding at present and it would be possible to break foreign
keys if we just do nothing), them we need these to be "transferred" to
the new table somehow.
The current implementation acquires AccessExclusiveLock on the table
(supposedly for very short time) so it can swap the table and index
files. Once we have that lock, I think the transactions holding the row locks
should no longer be running. Or can the row lock "survive" the table lock
somehow?
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2025-Jan-31, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Something that Robert Haas just mentioned to me is handling of row
locks: if concurrent transactions are keeping rows in the original table
locked (especially SELECT FOR KEY SHARE, since that's not considered by
logical decoding at present and it would be possible to break foreign
keys if we just do nothing), them we need these to be "transferred" to
the new table somehow.The current implementation acquires AccessExclusiveLock on the table
(supposedly for very short time) so it can swap the table and index
files. Once we have that lock, I think the transactions holding the row locks
should no longer be running. Or can the row lock "survive" the table lock
somehow?
Oh right, I forgot about this step. That seems like it should be
sufficient to protect against that problem.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".
On Thu, 30 Jan 2025 at 16:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-30, Michael Banck wrote:
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.For the record, there was an observation that [...]
Further observations:
First, due to the XLog-based change detection this feature can't work
for unlogged tables without first changing them to logged (which
implies first writing the whole table to XLog, to not cause issues on
any replicas). However, documentation for this limitation seems to be
missing from the patches, and I hope a solution can be found without
requiring LOGGED.
Second, I'm concerned about long-running snapshots: While I've not
read the patches fully, I think they work something like the
following:
1. Mark some start LSN as start for decoding changes
2. Do the usual REPACK operations, but with reduced locking
3. Apply the decoded changes
4. Switch the relfilenodes over
For (2), I think the scan needs a snapshot to guarantee we keep the
original tuples of updates around, wich will hold back any other
VACUUM activity in the database. For CIC/RIC, a solution is being
created [0]/messages/by-id/CANtu0oiLc-+7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg@mail.gmail.com, but I'm not sure the same can be applied to this REPACK
CONCURRENTLY: while CIC/RIC doesn't care much about cross-page update
chains (it's only interested in TID+field values for possibly-live
tuples), REPACK seems to require access to the fields of the old
versions of updated tuples to correctly apply updates, thus requiring
a single snapshot for the full scan.
Maybe that's something that can be further improved upon, maybe not.
REPACK CONCURRENTLY is an improvement over the current situation
w.r.t. locks, but it'd be nice if this new system does not impact the
visibility horizons of the cluster by more than the current.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
[0]: /messages/by-id/CANtu0oiLc-+7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg@mail.gmail.com
Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Further observations:
First, due to the XLog-based change detection this feature can't work
for unlogged tables without first changing them to logged (which
implies first writing the whole table to XLog, to not cause issues on
any replicas). However, documentation for this limitation seems to be
missing from the patches, and I hope a solution can be found without
requiring LOGGED.
Currently I've got no idea how to handle UNLOGGED table. I'll at least fix the
documentation.
Second, I'm concerned about long-running snapshots: While I've not
read the patches fully, I think they work something like the
following:1. Mark some start LSN as start for decoding changes
2. Do the usual REPACK operations, but with reduced locking
3. Apply the decoded changes
4. Switch the relfilenodes overFor (2), I think the scan needs a snapshot to guarantee we keep the
original tuples of updates around, wich will hold back any other
VACUUM activity in the database. For CIC/RIC, a solution is being
created [0], but I'm not sure the same can be applied to this REPACK
CONCURRENTLY: while CIC/RIC doesn't care much about cross-page update
chains (it's only interested in TID+field values for possibly-live
tuples), REPACK seems to require access to the fields of the old
versions of updated tuples to correctly apply updates, thus requiring
a single snapshot for the full scan.Maybe that's something that can be further improved upon, maybe not.
REPACK CONCURRENTLY is an improvement over the current situation
w.r.t. locks, but it'd be nice if this new system does not impact the
visibility horizons of the cluster by more than the current.
A single snapshot is used because there is a single stream of decoded data
changes. Thus a new version of a tuple is either visible to the snapshot or it
appears in the stream, but not both.
If part of the table was scanned using one snapshot, and another part with
another one, it'd be difficult to "put things together". For example, if the
first scan does not see a tuple for which the corresponding stream contains an
UPDATE change (because the old version is in the not-yet-scanned part of the
table), that UPDATE needs to be moved to the stream associated with another
snapshot. But that snapshot might not see that tuple either because it was
either deleted in between, or should be found by yet another scan.
Doing the repacking in several steps might be interesting, but I admit I
haven't yet thought that far.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2025-Jan-31, Antonin Houska wrote:
Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
First, due to the XLog-based change detection this feature can't work
for unlogged tables without first changing them to logged (which
implies first writing the whole table to XLog, to not cause issues on
any replicas). However, documentation for this limitation seems to be
missing from the patches, and I hope a solution can be found without
requiring LOGGED.Currently I've got no idea how to handle UNLOGGED table. I'll at least fix the
documentation.
Yeah, I think it should be possible, but it's going to require
complicated additional changes to support. I suggest that in the first
version we leave this out, and we can implement it afterwards.
For (2), I think the scan needs a snapshot to guarantee we keep the
original tuples of updates around, wich will hold back any other
VACUUM activity in the database.
A single snapshot is used because there is a single stream of decoded data
changes. Thus a new version of a tuple is either visible to the snapshot or it
appears in the stream, but not both.
I agree with Matthias that this is going to be a problem. In fact, if
we need to keep the snapshot for long enough (depending on how long it
takes to scan the table), then the snapshot that it needs to keep would
disrupt vacuuming on all the other tables, causing more bloat. If it's
bad enough (say because the table is big enough to take hours to repack
and recreate the indexes on), the bloat situation might be worse after
REPACK has completed than it was before.
But -- again -- I think we need to limit the complexity of this patch,
or otherwise we're never going to get it done. So I propose that in our
first implementation we continue to use a single snapshot, and we can
try to find ways to grab fresh snapshots from time to time as a later
improvement on the patch. Customers in situations so bad that they
can't use REPACK to fix their bloat in 18, are already unable to fix it
in earlier versions, so this would not be a regression.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
(Carlos Caszeli)
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-31, Antonin Houska wrote:
Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
First, due to the XLog-based change detection this feature can't work
for unlogged tables without first changing them to logged (which
implies first writing the whole table to XLog, to not cause issues on
any replicas). However, documentation for this limitation seems to be
missing from the patches, and I hope a solution can be found without
requiring LOGGED.Currently I've got no idea how to handle UNLOGGED table. I'll at least fix the
documentation.Yeah, I think it should be possible, but it's going to require
complicated additional changes to support. I suggest that in the first
version we leave this out, and we can implement it afterwards.For (2), I think the scan needs a snapshot to guarantee we keep the
original tuples of updates around, wich will hold back any other
VACUUM activity in the database.A single snapshot is used because there is a single stream of decoded data
changes. Thus a new version of a tuple is either visible to the snapshot or it
appears in the stream, but not both.I agree with Matthias that this is going to be a problem. In fact, if
we need to keep the snapshot for long enough (depending on how long it
takes to scan the table), then the snapshot that it needs to keep would
disrupt vacuuming on all the other tables, causing more bloat. If it's
bad enough (say because the table is big enough to take hours to repack
and recreate the indexes on), the bloat situation might be worse after
REPACK has completed than it was before.But -- again -- I think we need to limit the complexity of this patch,
or otherwise we're never going to get it done. So I propose that in our
first implementation we continue to use a single snapshot, and we can
try to find ways to grab fresh snapshots from time to time as a later
improvement on the patch. Customers in situations so bad that they
can't use REPACK to fix their bloat in 18, are already unable to fix it
in earlier versions, so this would not be a regression.
I thought about it more during the afternoon. I think that in this case
(i.e. snapshot created by the logical replication system), the xmin horizon is
controlled by the xmin of the replication slot rather than that of the
snapshot. And I think that the slot we use for REPACK can have xmin set to
invalid (unlike catalog_xmin) as long as we ensure that (even "lazy") VACUUM
ignores table that is being processed by REPACK. In other words, REPACK does
not have to disrupt vacuuming of the other tables. Please correct me if I'm
wrong.
Since the current proposal of REPACK already stores the relation OID in the
shared memory (so that all backends know that they should write enough
information to WAL when doing changes in the table), disabling VACUUM for that
table should not be difficult.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.
@@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
+ if (concurrent) + { + PgBackendProgress progress; + + /* + * Command progress reporting gets terminated at subtransaction + * end. Save the status so it can be eventually restored. + */ + memcpy(&progress, &MyBEEntry->st_progress, + sizeof(PgBackendProgress)); + + /* Release the locks by aborting the subtransaction. */ + RollbackAndReleaseCurrentSubTransaction(); + + /* Restore the progress reporting status. */ + pgstat_progress_restore_state(&progress); + + CurrentResourceOwner = oldowner; + }
I was looking at 0002 to see if it'd make sense to commit it ahead of a
fuller review of the rest, and I find that the reason for that patch is
this hunk you have here in copy_table_data -- you want to avoid a
subtransaction abort (which you use to release planner lock) clobbering
the status. I think this a bad idea. It might be better to handle this
in a different way, for instance
1) maybe have a flag that says "do not reset progress status during
subtransaction abort"; REPACK would set that flag, so it'd be able to
continue its business without having to memcpy the current status (which
seems like quite a hack) or restoring it afterwards.
2) maybe subtransaction abort is not the best way to release the
planning locks anyway. I think it might be better to have a
ResourceOwner that owns those locks, and we do ResourceOwnerRelease()
which would release them. I think this would be a novel usage of
ResourceOwner so it needs more research. But if this works, then we
don't need the subtransaction at all, and therefore we don't need
backend progress restore at all either.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.@@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
+ if (concurrent) + { + PgBackendProgress progress; + + /* + * Command progress reporting gets terminated at subtransaction + * end. Save the status so it can be eventually restored. + */ + memcpy(&progress, &MyBEEntry->st_progress, + sizeof(PgBackendProgress)); + + /* Release the locks by aborting the subtransaction. */ + RollbackAndReleaseCurrentSubTransaction(); + + /* Restore the progress reporting status. */ + pgstat_progress_restore_state(&progress); + + CurrentResourceOwner = oldowner; + }I was looking at 0002 to see if it'd make sense to commit it ahead of a
fuller review of the rest, and I find that the reason for that patch is
this hunk you have here in copy_table_data -- you want to avoid a
subtransaction abort (which you use to release planner lock) clobbering
the status. I think this a bad idea. It might be better to handle this
in a different way, for instance1) maybe have a flag that says "do not reset progress status during
subtransaction abort"; REPACK would set that flag, so it'd be able to
continue its business without having to memcpy the current status (which
seems like quite a hack) or restoring it afterwards.2) maybe subtransaction abort is not the best way to release the
planning locks anyway. I think it might be better to have a
ResourceOwner that owns those locks, and we do ResourceOwnerRelease()
which would release them. I think this would be a novel usage of
ResourceOwner so it needs more research. But if this works, then we
don't need the subtransaction at all, and therefore we don't need
backend progress restore at all either.
If this needs change, I prefer 2) because it's less invasive: 1) still affects
the progress monitoring code. I'll look at it.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
From bf2ec8c5d753de340140839f1b061044ec4c1149 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 13 Jan 2025 14:29:54 +0100
Subject: [PATCH 4/8] Add CONCURRENTLY option to both VACUUM FULL and CLUSTER
commands.@@ -950,8 +1412,46 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
+ if (concurrent) + { + PgBackendProgress progress; + + /* + * Command progress reporting gets terminated at subtransaction + * end. Save the status so it can be eventually restored. + */ + memcpy(&progress, &MyBEEntry->st_progress, + sizeof(PgBackendProgress)); + + /* Release the locks by aborting the subtransaction. */ + RollbackAndReleaseCurrentSubTransaction(); + + /* Restore the progress reporting status. */ + pgstat_progress_restore_state(&progress); + + CurrentResourceOwner = oldowner; + }I was looking at 0002 to see if it'd make sense to commit it ahead of a
fuller review of the rest, and I find that the reason for that patch is
this hunk you have here in copy_table_data -- you want to avoid a
subtransaction abort (which you use to release planner lock) clobbering
the status. I think this a bad idea. It might be better to handle this
in a different way, for instance1) maybe have a flag that says "do not reset progress status during
subtransaction abort"; REPACK would set that flag, so it'd be able to
continue its business without having to memcpy the current status (which
seems like quite a hack) or restoring it afterwards.2) maybe subtransaction abort is not the best way to release the
planning locks anyway. I think it might be better to have a
ResourceOwner that owns those locks, and we do ResourceOwnerRelease()
which would release them. I think this would be a novel usage of
ResourceOwner so it needs more research. But if this works, then we
don't need the subtransaction at all, and therefore we don't need
backend progress restore at all either.If this needs change, I prefer 2) because it's less invasive: 1) still affects
the progress monitoring code. I'll look at it.
Below is what I suggest now. It resembles the use of PortalData.resowner in
the sense that it's a resource owner separate from the resource owner of the
transaction.
Although it's better to use a resource owner than a subtransaction here, we
still need to restore the progress state in
cluster_decode_concurrent_changes() (see v07-0004-) because a subtransaction
aborts that clear it can take place during the decoding.
My preference would still be to save and restore the progress state in this
case (although a new function like pgstat_progress_save_state() would be
better than memcpy()). What do you think?
@@ -950,8 +1412,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On Thu, Jan 30, 2025 at 04:29:35PM +0100, Alvaro Herrera wrote:
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.
+1
One small thing I thought of after the meeting was that this effectively
forces users to always specify an index if they want to REPACK WITH INDEX.
Today, CLUSTER will use the same index as before if one is not specified.
IMHO requiring users to specify the index is entirely reasonable, but I
figured I'd at least note the behavior change.
--
nathan
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-30, Michael Banck wrote:
I haven't addressed the problem of a new command yet - for that I'd like to
see some sort of consensus, so that I do not have to do all the related
changes many times.Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.
This is a patch that adds the REPACK command (w/o CONCURRENTLY). I'll
incorporate it into the patch series but it'd be great if this part was a
little bit stable before I start to rebase the depending patches. Thanks.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
--
*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte,
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.
*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are
confidential and may be privileged or otherwise protected from disclosure
and solely for the use of the person(s) or entity to whom it is intended.
If you have received this message in error and are not the intended
recipient, please notify the sender immediately and delete this message and
any attachment from your system. If you are not the intended recipient, be
advised that any use of this message is prohibited and may be unlawful, and
you must not copy this message or attachment or disclose the contents to
any other person.
Attachments:
0001-Add-REPACK-command.patchtext/x-diffDownload
From b09ae021ea8fa7b4a90775c25e3f0ddaa711ef82 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 19 Feb 2025 16:17:37 +0100
Subject: [PATCH] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 64 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 59 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1385 insertions(+), 236 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 928a6eb64b..8a1ed9b645 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5916,6 +5924,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867..c0ef654fcb 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea..54bb2362c8 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 0000000000..e2b96c12fb
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space, allowing unused space to be returned to the
+ operating
+ system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d4..2b5a5d0ac4 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83f..229912d35b 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c0bec01415..5c9dcc938d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -737,13 +737,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -755,15 +755,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -805,7 +805,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -821,7 +821,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -908,14 +908,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -948,14 +948,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -973,7 +973,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cdabf78024..a3c93eea5f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4052,7 +4052,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957..412627e22b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1266,6 +1266,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 99193f5c88..d0f2588a97 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1453,8 +1438,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1504,14 +1489,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1661,7 +1646,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1682,14 +1668,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1697,17 +1736,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1715,15 +1770,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1737,13 +1792,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 72a1b64c2a..25d21d8838 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15525,7 +15525,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0239d9bae6..59dddcd31f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2248,7 +2248,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d3887628d4..3077606a17 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11869,6 +11870,61 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK <qualified_name> [ USING INDEX <index_name> ]
+ * REPACK (options) <qualified_name> [ USING INDEX <index_name> ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17909,6 +17965,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18540,6 +18597,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d5801..bf3ba3c2ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e9096a8849..fdfb63e3ba 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index eb8bc12872..5c32ef1cfb 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4909,6 +4909,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cb..c2976905e4 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03..931dab215e 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,28 +56,51 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * CAUTION: These values are also used by CLUSTER. When enhancing REPACK, add
+ * the new values at the end of the list to avoid renumbering.
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * CAUTION: These values are also used by CLUSTER. When enhancing REPACK, add
+ * the new values at the end of the list to avoid renumbering.
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
#define PROGRESS_CLUSTER_COMMAND_VACUUM_FULL 2
+#define PROGRESS_CLUSTER_COMMAND_REPACK 3
/* Progress parameters for CREATE INDEX */
/* 3, 4 and 5 reserved for "waitfor" metrics */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 8dd421fa0e..f8225722fd 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3903,6 +3903,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce6..0932d6fce5 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d5..cceb312f2b 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab40..da3d14bb97 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809a..ed7df29b8e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5baba8d39f..d678256816 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f8610..e348e26fbf 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 64c6bf7a89..00d0fc7296 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -408,6 +408,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2452,6 +2453,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
You mentioned fillfactor only on cluster notes, would be good to mention it
on refsynopsisdiv, I think.
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new
disk
- file with no extra space, allowing unused space to be returned to the
+ file with no extra space, obeying fillfactor settings, allowing unused
space to be returned to the
+ operating
+ system.
+ </para>
regards
Marcos
Em qua., 19 de fev. de 2025 às 14:08, Antonin Houska <ah@cybertec.at>
escreveu:
Show quoted text
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Jan-30, Michael Banck wrote:
I haven't addressed the problem of a new command yet - for that I'd
like to
see some sort of consensus, so that I do not have to do all the
related
changes many times.
Well, looks like this patch-set is blocked on the bikeshedding part?
Somebody should call a shot here, then.
A bunch of people discussed this patch in today's developer meeting in
Brussels. There's pretty much a consensus on using the verb REPACK
CONCURRENTLY for this new command -- where unadorned REPACK would be
VACUUM FULL, and we'd have something like REPACK WITH INDEX or maybe
REPACK USING INDEX to take the CLUSTER place.This is a patch that adds the REPACK command (w/o CONCURRENTLY). I'll
incorporate it into the patch series but it'd be great if this part was a
little bit stable before I start to rebase the depending patches. Thanks.--
Antonin Houska
Web: https://www.cybertec-postgresql.com--
*E-Mail Disclaimer*
Der Inhalt dieser E-Mail ist ausschliesslich fuer den
bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Adressat
dieser E-Mail oder dessen Vertreter sein sollten, so beachten Sie bitte,
dass jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung
oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir bitten Sie, sich
in diesem Fall mit dem Absender der E-Mail in Verbindung zu setzen.*CONFIDENTIALITY NOTICE & DISCLAIMER
*This message and any attachment are
confidential and may be privileged or otherwise protected from disclosure
and solely for the use of the person(s) or entity to whom it is intended.
If you have received this message in error and are not the intended
recipient, please notify the sender immediately and delete this message
and
any attachment from your system. If you are not the intended recipient, be
advised that any use of this message is prohibited and may be unlawful,
and
you must not copy this message or attachment or disclose the contents to
any other person.
Marcos Pegoraro <marcos@f10.com.br> wrote:
You mentioned fillfactor only on cluster notes, would be good to mention it
on refsynopsisdiv, I think.
ok, I've added a note to the first paragraph.
Attached here is the REPACK command as well as the patch set that adds the
CONCURRENTLY option. The new symbols have been renamed so they resemble REPACK
rather than CLUSTER.
Please note that 0008 is a new part which makes the setting wal_leve=logical
unnecessary.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v08-0001-Add-REPACK-command.patchtext/x-diffDownload
From 2f83bde90c324ce515455a03135bcde34dd70760 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 1/9] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 60 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1385 insertions(+), 236 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9178f1d34e..58e1becf02 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5885,6 +5893,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867..c0ef654fcb 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea..54bb2362c8 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 0000000000..84f3c3e3f2
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 971b1237d4..2b5a5d0ac4 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83f..229912d35b 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3ce..5c3cab8bc2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -737,13 +737,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -755,15 +755,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -805,7 +805,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -821,7 +821,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -908,14 +908,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -948,14 +948,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -973,7 +973,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f37b990c81..c84f67059a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4051,7 +4051,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf..b8209b2acd 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1262,6 +1262,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 99193f5c88..d0f2588a97 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1453,8 +1438,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1504,14 +1489,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1661,7 +1646,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1682,14 +1668,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1697,17 +1736,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1715,15 +1770,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1737,13 +1792,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ce7d115667..901cb321c3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15510,7 +15510,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0239d9bae6..59dddcd31f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2248,7 +2248,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c..8b4c226495 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11869,6 +11870,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17909,6 +17964,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18540,6 +18596,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d5801..bf3ba3c2ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0ea41299e0..02ac18fca6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 8432be641a..72338fffb2 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4909,6 +4909,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cb..c2976905e4 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03..7644267e14 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0b208f51bd..03ed0450df 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3914,6 +3914,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce6..0932d6fce5 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d5..cceb312f2b 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab40..da3d14bb97 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809a..ed7df29b8e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b..50d87af2fd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f8610..e348e26fbf 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfbab589d6..098a7a602c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2464,6 +2465,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
v08-0002-Move-progress-related-fields-from-PgBackendStatus-to.patchtext/x-diffDownload
From 7921185eb112e7c1839981e6d5309493c697a1c1 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 2/9] Move progress related fields from PgBackendStatus to
PgBackendProgress.
REPACK CONCURRENTLY will need to save and restore these fields at some
point. This is because plan_cluster_use_sort() has to be called in a
subtransaction (so that it does not leave any additional locks on the table)
and rollback of that subtransaction clears the progress information.
---
src/backend/access/heap/vacuumlazy.c | 2 +-
src/backend/commands/analyze.c | 2 +-
src/backend/utils/activity/backend_progress.c | 18 +++++++++---------
src/backend/utils/activity/backend_status.c | 4 ++--
src/backend/utils/adt/pgstatfuncs.c | 6 +++---
src/include/utils/backend_progress.h | 14 ++++++++++++++
src/include/utils/backend_status.h | 14 ++------------
7 files changed, 32 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1af18a78a2..5a0eb1ca80 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1100,7 +1100,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* the st_progress_param array.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cd75954951..49fe8c43ce 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -815,7 +815,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
* only updated by the calling process.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 99a8c73bf0..eebc968193 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -32,9 +32,9 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.command = cmdtype;
+ beentry->st_progress.command_target = relid;
+ MemSet(&beentry->st_progress.param, 0, sizeof(beentry->st_progress.param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -55,7 +55,7 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -76,7 +76,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -133,7 +133,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -154,11 +154,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 5f68ef26ad..647bf863f0 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -376,8 +376,8 @@ pgstat_bestart(void)
#endif
lbeentry.st_state = STATE_UNDEFINED;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
+ lbeentry.st_progress.command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
/*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 02ac18fca6..6603b2a64b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -299,7 +299,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.command != cmdtype)
continue;
/* Value available to all callers */
@@ -309,9 +309,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index da3d14bb97..2f1de46d05 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -31,8 +31,22 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
+
#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+typedef struct PgBackendProgress
+{
+ ProgressCommandType command;
+ Oid command_target;
+ int64 param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index d3d4ff6c5c..a73c76a442 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -155,18 +155,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.43.5
v08-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From ced00311269d331e5b91985e08b2330aa8069dbd Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 3/9] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bd0680dcbe..8c83ff6feb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee..42bded373b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -153,7 +153,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -532,7 +531,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e..6d4d2d1814 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be7164..147b190210 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.43.5
v08-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From 8da24b040720e5a30517076825760c7516c42bfa Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 4/9] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we cannot get stronger
lock without releasing the weaker one first, we acquire the exclusive lock in
the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if REPACK CONCURRENTLY is running on the same
relation, and raises an ERROR if it is.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
application concurrently. However, this lag should not be more than a single
WAL segment.
---
doc/src/sgml/monitoring.sgml | 65 +-
doc/src/sgml/ref/repack.sgml | 116 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 2667 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 17 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 286 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 24 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 93 +-
src/include/commands/progress.h | 17 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 5 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
41 files changed, 3536 insertions(+), 253 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 58e1becf02..8d73c01c55 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5780,14 +5780,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6003,14 +6024,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6091,6 +6133,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 84f3c3e3f2..9ee640e351 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,115 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a..b18c9a14ff 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed..cb856a74ee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2093,8 +2093,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing REPACK CONCURRENTLY, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5c3cab8bc2..b2bfd05dc9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -681,6 +685,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -701,6 +707,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -779,8 +787,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -835,7 +845,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -851,14 +861,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -870,7 +881,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -884,8 +895,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -896,9 +905,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -915,7 +962,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -930,6 +977,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -973,7 +1047,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2626,6 +2700,53 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
}
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Definition of the heap table access method.
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index e146605bd5..d9be93aadc 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,16 +955,31 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1073,6 +1088,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c84f67059a..39b121c0b8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1417,22 +1417,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1471,6 +1456,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8209b2acd..c301d83d9b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,16 +1249,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1275,16 +1276,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index d0f2588a97..592ff6041b 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -76,14 +86,96 @@ typedef struct
((cmd) == CLUSTER_COMMAND_REPACK ? \
"repack" : "vacuum"))
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
+ "relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTE CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables repacked_rel_locator
+ * and repacked_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if REPACK CONCURRENTLY is running, before they start to do
+ * the actual changes (see is_concurrent_repack_in_progress()). Anything else
+ * must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, ClusterCommand cmd,
bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -91,8 +183,91 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
static Relation process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
Oid *indexOid_p);
@@ -151,8 +326,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_CLUSTER, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel,
+ ¶ms, &indexOid);
if (rel == NULL)
return;
}
@@ -202,7 +378,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -219,8 +396,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -240,10 +417,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -267,12 +444,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which commands is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -282,8 +465,53 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VREPACK CONCURRENTLY after our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_repack_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ can_repack_concurrently(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -333,7 +561,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -348,7 +576,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -359,7 +587,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -370,7 +598,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -390,6 +618,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot %s a shared catalog", cmd_str)));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -411,8 +644,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
- cmd);
+ check_index_is_clusterable(OldHeap, indexOid, lmode, cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -429,7 +661,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -442,11 +675,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_repack(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose, cmd, concurrent);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -595,19 +859,86 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+void
+can_repack_concurrently(Relation rel)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -615,13 +946,81 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we spend notable amount of time to
+ * setup the logical decoding. Not sure though if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ }
if (index)
/* Mark the correct index as clustered */
@@ -629,7 +1028,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -645,30 +1043,51 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
- &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
+ cmd, &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -803,14 +1222,18 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- ClusterCommand cmd, bool *pSwapToastByContent,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, ClusterCommand cmd, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
@@ -829,6 +1252,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -855,8 +1279,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the REPACK CONCURRENTLY case, the lock does not help because we need
+ * to release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in RepackedRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -932,8 +1360,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -965,7 +1433,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -974,7 +1444,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENTLY case, we need to set it again before
+ * applying the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1427,14 +1901,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1460,39 +1933,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1804,89 +2284,1975 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
*/
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxRepackedRels = 0;
+
+Size
+RepackShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+ maxRepackedRels = max_replication_slots;
+
+ return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+}
+
void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+RepackShmemInit(void)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ HASHCTL info;
- /* Parse option list */
- foreach(lc, stmt->params)
- {
- DefElem *opt = (DefElem *) lfirst(lc);
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ RepackedRelsHash = ShmemInitHash("Repacked Relations",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ RepackedRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
- if (stmt->relation != NULL)
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
{
- rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_REPACK, ¶ms,
- &indexOid);
- if (rel == NULL)
- return;
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(rel))));
}
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ *entered_p = true;
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- Oid relid;
- bool rel_is_index;
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+ LWLockRelease(RepackedRelsLock);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
- {
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+ /* Make sure the reopened relcache entry is used, not the old one. */
+ rel = *rel_p;
+
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_repack(bool error)
+{
+ RepackedRel key;
+ RepackedRel *entry = NULL, *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ repacked_rel_toast = InvalidOid;
+ }
+ LWLockRelease(RepackedRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_repack(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel) || OidIsValid(repacked_rel_toast))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if REPACK CONCURRENTLY is already running for given relation, and if
+ * so, raise ERROR. The problem is that cluster_rel() needs to release its
+ * lock on the relation temporarily at some point, so our lock alone does not
+ * help. Commands that might break what cluster_rel() is doing should call
+ * this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_repack(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * ShareUpdateExclusiveLock, the check makes little sense because REPACK
+ * CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < ShareUpdateExclusiveLock)
+ return;
+
+ /*
+ * The caller has a lock which conflicts with REPACK CONCURRENTLY, so if
+ * that's not running now, it cannot start until the caller's transaction
+ * has completed.
+ */
+ if (is_concurrent_repack_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for REPACK CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ *
+ * This function was already called, but the relation was unlocked since
+ * (see begin_concurrent_repack()). check_catalog_changes() should catch
+ * any "disruptive" changes in the future.
+ */
+ can_repack_concurrently(rel);
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of REPACK CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != repacked_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_repack() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, repacked_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ repacked_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
+
+ /*
+ * Check if we can use logical decoding.
+ */
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw, *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel, ¶ms, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
rtcs = get_tables_to_repack(repack_context);
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
+
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1904,7 +4270,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd, ClusterParams *params,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel, ClusterParams *params,
Oid *indexOid_p)
{
Relation rel;
@@ -1914,12 +4281,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -1973,7 +4338,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 0bfbc5ca6d..eae34fbe6c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -906,7 +906,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 901cb321c3..364f2b6a81 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4527,6 +4527,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENTLY is in progress. If
+ * lockmode is too weak, cluster_rel() should detect incompatible DDLs
+ * executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5960,6 +5970,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 59dddcd31f..30e1bb5719 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -123,7 +123,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -633,7 +633,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1989,7 +1990,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2249,7 +2250,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2295,7 +2296,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db21480..50aa385a58 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8b4c226495..5c937b7db8 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11874,27 +11874,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK qualified_name repack_index_specification
+ REPACK opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2;
- n->indexname = $3;
+ n->concurrent = $2;
+ n->relation = $3;
+ n->indexname = $4;
n->params = NIL;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ | REPACK '(' utility_option_list ')' opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5;
- n->indexname = $6;
n->params = $3;
+ n->concurrent = $5;
+ n->relation = $6;
+ n->indexname = $7;
$$ = (Node *) n;
}
@@ -11905,6 +11908,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
+ n->concurrent = false;
$$ = (Node *) n;
}
@@ -11915,6 +11919,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
+ n->concurrent = false;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 24d88f368d..a6df190747 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
+ * only decode data changes of the table that it is processing, and the
+ * changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 8c83ff6feb..c54a1277cc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 0000000000..4efeb713b7
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 0000000000..133e865a4a
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 0000000000..1ef9b3cbfd
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,286 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst, *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed7036..07e477d279 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index bf3ba3c2ae..4ee4c47487 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1307,6 +1307,16 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENT is in
+ * progress. If lockmode is too weak, cluster_rel() should
+ * detect incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index eebc968193..e2c84baba9 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -162,3 +162,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.command = backup->command;
+ beentry->st_progress.command_target = backup->command_target;
+ memcpy(MyBEEntry->st_progress.param, backup->param,
+ sizeof(beentry->st_progress.param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f07162..5a0097d53b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -346,6 +346,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 700ccb6df9..cb92ddb1e3 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1569,6 +1569,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 398114373e..1273149178 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1249,6 +1250,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 42bded373b..103d1249bb 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -154,7 +154,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -587,7 +586,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 72338fffb2..9aefd06481 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4910,18 +4910,26 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f..bdeb2f8354 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -421,6 +421,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15..aa3190986a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1673,6 +1676,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1685,6 +1692,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1693,6 +1702,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5e..66431cc19e 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c2976905e4..6fb5f5509c 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,14 +52,91 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void can_repack_concurrently(Relation rel);
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -61,9 +144,15 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7644267e14..6b1b1a4c1a 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -67,10 +67,12 @@
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 8
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_REPACK */
#define PROGRESS_REPACK_COMMAND_REPACK 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 03ed0450df..d6053cab9f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3924,6 +3924,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814..802fc4b082 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f..b0d81b736d 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,9 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK
+ * CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf56545238..f07973b458 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 2f1de46d05..cea341276b 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -36,7 +36,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -56,6 +56,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 40658ba2ff..6b2faed672 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -49,6 +49,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index db3e504c3d..741b29226d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -691,7 +694,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_repack_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210..5eeabdc6c4 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 50d87af2fd..587c0c85b0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1969,17 +1969,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2055,17 +2055,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
--
2.43.5
v08-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From d69d88cade2d164ca019c3fd5eea8e8ff2a6eea3 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 5/9] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.
However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". Now the tuples
written into the new table storage have the same XID and command ID (CID) as
they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_repack/pgoutput_repack.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
src/include/utils/snapshot.h | 3 +
13 files changed, 389 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346c..75d889ec72 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb856a74ee..66d21e3c9f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -1989,7 +1990,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -2005,15 +2006,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2644,7 +2646,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2701,11 +2704,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2722,6 +2725,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -2947,7 +2951,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3015,8 +3020,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3037,6 +3046,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3126,10 +3144,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3168,12 +3187,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3213,6 +3231,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4050,8 +4069,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -4061,7 +4084,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4416,10 +4440,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8750,7 +8774,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8761,10 +8786,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b2bfd05dc9..5876a96e79 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -309,7 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -327,8 +330,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d..0bfd329847 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -125,6 +125,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -971,6 +983,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nRepackCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -990,6 +1004,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nRepackCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ RepackCurrentXids,
+ nRepackCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5628,6 +5657,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetRepackCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+ RepackCurrentXids = xip;
+ nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ * Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+ RepackCurrentXids = NULL;
+ nRepackCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 592ff6041b..b336d760f2 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -209,6 +209,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2960,6 +2961,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3115,6 +3119,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw, *src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3183,8 +3188,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
+ */
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3195,6 +3222,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetRepackCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3207,11 +3236,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3231,10 +3263,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3242,6 +3294,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3252,6 +3305,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetRepackCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3269,18 +3324,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3290,6 +3363,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3300,7 +3374,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
}
@@ -3318,7 +3407,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3326,7 +3415,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
int2vector *ident_indkey;
HeapTuple result = NULL;
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3398,6 +3487,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetRepackCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a6df190747..55abda75d1 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
- * only decode data changes of the table that it is processing, and the
- * changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if REPACK CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +986,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c54a1277cc..554fe83f4b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 1ef9b3cbfd..d42d93a8b6 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
RepackDecodingState *dstate;
+ Snapshot snapshot;
dstate = (RepackDecodingState *) ctx->output_writer_private;
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -264,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -277,6 +328,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bdeb2f8354..b0c6f1d916 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -325,21 +325,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf..8d4af07f84 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee04..fbb66d559b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 6fb5f5509c..ef3cb55751 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -73,6 +73,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -104,6 +112,8 @@ typedef struct RepackDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -124,6 +134,14 @@ typedef struct RepackDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} RepackDecodingState;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec149..014f27db7d 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
--
2.43.5
v08-0006-Add-regression-tests.patchtext/x-diffDownload
From 2d84095db2a68290e0e462ed7e84759eeed3568f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 6/9] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 140 ++++++++++++++++++
6 files changed, 267 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b336d760f2..1ff0fcd1d9 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3710,6 +3711,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d..405d0811b4 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 0000000000..49a736ed61
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 0000000000..c8f264bc6c
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712f..0e3c47ba99 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 0000000000..5aa8983f98
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.43.5
v08-0007-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From 25026937685b8c6b146f95efdd4dfedffc1716e0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH 7/9] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 14 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 292 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e55700f35b..758cf6849d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10888,6 +10888,37 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9ee640e351..0c250689d1 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -188,7 +188,14 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5876a96e79..beec45b18e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1004,7 +1004,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1ff0fcd1d9..a5790d77b5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -108,6 +110,15 @@ RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
"relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -197,7 +208,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -214,13 +226,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3016,7 +3030,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3054,6 +3069,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3096,7 +3114,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3126,6 +3145,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3250,10 +3272,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3456,11 +3490,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -3469,10 +3507,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3484,7 +3531,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3494,6 +3541,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3654,6 +3723,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3733,7 +3804,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3855,9 +3927,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c..a1a19f2cbd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,6 +39,7 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
@@ -2814,6 +2815,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5362ff8051..b59f8aae7e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -744,6 +744,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index ef3cb55751..f5600bf4f6 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -154,7 +156,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void can_repack_concurrently(Relation rel);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index 49a736ed61..f2728d9422 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 5aa8983f98..0f45f9d254 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.43.5
v08-0008-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From c5adedcec21f4523243af2c0e5c0d63ef2e54b23 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:21 +0100
Subject: [PATCH 8/9] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
src/backend/access/transam/parallel.c | 8 ++
src/backend/access/transam/xact.c | 106 ++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 107 ++++++++++++++----
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/standby.c | 4 +-
src/include/access/xlog.h | 15 ++-
src/include/commands/cluster.h | 1 +
src/include/utils/rel.h | 6 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
12 files changed, 216 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df9213..4e6e92c4db 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -97,6 +97,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -351,6 +352,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1546,6 +1548,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0bfd329847..0e4f234440 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -137,6 +138,12 @@ static TransactionId *ParallelCurrentXids;
static int nRepackCurrentXids = 0;
static TransactionId *RepackCurrentXids = NULL;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -648,6 +655,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -662,6 +670,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -691,20 +725,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -729,6 +749,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2243,6 +2311,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e1..4a8a43a6da 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a5790d77b5..910ff9fa91 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1298,7 +1298,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
*
* In the REPACK CONCURRENTLY case, the lock does not help because we need
* to release it temporarily at some point. Instead, we expect VACUUM /
- * CLUSTER to skip tables which are present in RepackedRelsHash.
+ * CLUSTER to skip tables which are present in repackedRels->hashtable.
*/
if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
@@ -2312,7 +2312,16 @@ typedef struct RepackedRel
Oid dbid;
} RepackedRel;
-static HTAB *RepackedRelsHash = NULL;
+typedef struct RepackedRels
+{
+ /* Hashtable of RepackedRel elements. */
+ HTAB *hashtable;
+
+ /* The number of elements in the hashtable.. */
+ pg_atomic_uint32 nrels;
+} RepackedRels;
+
+static RepackedRels *repackedRels = NULL;
/* Maximum number of entries in the hashtable. */
static int maxRepackedRels = 0;
@@ -2320,28 +2329,44 @@ static int maxRepackedRels = 0;
Size
RepackShmemSize(void)
{
+ Size result;
+
+ result = sizeof(RepackedRels);
+
/*
* A replication slot is needed for the processing, so use this GUC to
* allocate memory for the hashtable.
*/
maxRepackedRels = max_replication_slots;
- return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ result += hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ return result;
}
void
RepackShmemInit(void)
{
+ bool found;
HASHCTL info;
+ repackedRels = ShmemInitStruct("Repacked Relations",
+ sizeof(RepackedRels),
+ &found);
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&repackedRels->nrels, 0);
+ }
+ else
+ Assert(found);
+
info.keysize = sizeof(RepackedRel);
info.entrysize = info.keysize;
-
- RepackedRelsHash = ShmemInitHash("Repacked Relations",
- maxRepackedRels,
- maxRepackedRels,
- &info,
- HASH_ELEM | HASH_BLOBS);
+ repackedRels->hashtable = ShmemInitHash("Repacked Relations Hash",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
}
/*
@@ -2373,12 +2398,13 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
RelReopenInfo rri[2];
int nrel;
static bool before_shmem_exit_callback_setup = false;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
relid = RelationGetRelid(rel);
/*
- * Make sure that we do not leave an entry in RepackedRelsHash if exiting
- * due to FATAL.
+ * Make sure that we do not leave an entry in repackedRels->Hashtable if
+ * exiting due to FATAL.
*/
if (!before_shmem_exit_callback_setup)
{
@@ -2393,7 +2419,7 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
*entered_p = false;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL, &found);
if (found)
{
/*
@@ -2411,6 +2437,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
/*
* Even if the insertion of TOAST relid should fail below, the caller has
* to do cleanup.
@@ -2438,7 +2468,8 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
{
key.relid = toastrelid;
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL,
+ &found);
if (found)
/*
* If we could enter the main fork the TOAST should succeed
@@ -2452,6 +2483,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
Assert(!OidIsValid(repacked_rel_toast));
repacked_rel_toast = toastrelid;
}
@@ -2531,6 +2566,7 @@ end_concurrent_repack(bool error)
RepackedRel *entry = NULL, *entry_toast = NULL;
Oid relid = repacked_rel;
Oid toastrelid = repacked_rel_toast;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
/* Remove the relation from the hash if we managed to insert one. */
if (OidIsValid(repacked_rel))
@@ -2539,23 +2575,32 @@ end_concurrent_repack(bool error)
key.relid = repacked_rel;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
- entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ entry = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
+ NULL);
/*
* By clearing this variable we also disable
* cluster_before_shmem_exit_callback().
*/
repacked_rel = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
/* Remove the TOAST relation if there is one. */
if (OidIsValid(repacked_rel_toast))
{
key.relid = repacked_rel_toast;
- entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ entry_toast = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
NULL);
repacked_rel_toast = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
LWLockRelease(RepackedRelsLock);
@@ -2621,7 +2666,7 @@ end_concurrent_repack(bool error)
}
/*
- * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
*/
static void
cluster_before_shmem_exit_callback(int code, Datum arg)
@@ -2632,24 +2677,48 @@ cluster_before_shmem_exit_callback(int code, Datum arg)
/*
* Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
*/
bool
is_concurrent_repack_in_progress(Oid relid)
{
RepackedRel key, *entry;
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just use the counter.
+ */
+ if (!OidIsValid(relid))
+ {
+ if (pg_atomic_read_u32(&repackedRels->nrels) > 0)
+ return true;
+ else
+ return false;
+ }
+
+ /* For particular relation we need to search in the hashtable. */
memset(&key, 0, sizeof(key));
key.relid = relid;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_SHARED);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ hash_search(repackedRels->hashtable, &key, HASH_FIND, NULL);
LWLockRelease(RepackedRelsLock);
return entry != NULL;
}
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
+}
+
/*
* Check if REPACK CONCURRENTLY is already running for given relation, and if
* so, raise ERROR. The problem is that cluster_rel() needs to release its
@@ -2944,8 +3013,8 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
- * regarding logical decoding, treated like a catalog.
+ * table we are processing is present in repackedRels->hashtable and
+ * therefore, regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
NIL,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3..e5790d3fe8 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f8..413bcc1add 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1313,13 +1313,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c02..a325bb1d16 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index f5600bf4f6..3ed3066b36 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -173,6 +173,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
extern Size RepackShmemSize(void);
extern void RepackShmemInit(void);
extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 741b29226d..81f5348a4f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -709,12 +709,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4..4f6c0ca3a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6c..0000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba99..716e5619aa 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
--
2.43.5
v08-0009-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 36fa8657637b9d1738d03405807a2ae1799e3637 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:21 +0100
Subject: [PATCH 9/9] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. REPACK CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by REPACK, this
command must record the information on individual tuples being moved from the
old file to the new one. This is what logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. REPACK CONCURRENTLY is processing a relation, but once it has released all
the locks (in order to get the exclusive lock), another backend runs REPACK
CONCURRENTLY on the same table. Since the relation is treated as a system
catalog while these commands are processing it (so it can be scanned using a
historic snapshot during the "initial load"), it is important that the 2nd
backend does not break decoding of the "combo CIDs" performed by the 1st
backend.
However, it's not practical to let multiple backends run REPACK CONCURRENTLY
on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 ++++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_repack/pgoutput_repack.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 194 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index beec45b18e..f8528b3acd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -730,7 +730,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced..94b603423d 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 910ff9fa91..562d778f62 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -209,17 +210,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -233,7 +238,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3184,7 +3190,7 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3258,7 +3264,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3266,7 +3273,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -3308,11 +3315,23 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetRepackCurrentXids();
@@ -3365,9 +3384,12 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3377,6 +3399,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3392,6 +3417,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3425,15 +3458,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3443,7 +3483,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3453,6 +3493,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3476,8 +3520,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3486,7 +3531,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3567,7 +3615,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
RepackDecodingState *dstate;
@@ -3600,7 +3649,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3785,6 +3835,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3870,11 +3921,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3996,6 +4062,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* repack_max_xlock_time is not exceeded.
@@ -4023,11 +4094,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 55abda75d1..973867c58a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -983,11 +983,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1012,6 +1014,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1033,11 +1042,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1054,6 +1066,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1063,6 +1076,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1081,6 +1101,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1099,11 +1127,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1135,6 +1164,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index d42d93a8b6..71b010c351 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -33,7 +33,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -168,7 +168,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -186,10 +187,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -202,7 +203,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -236,13 +238,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -315,6 +317,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 99c3f362ad..eebda35c7c 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 3ed3066b36..db029c62cf 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -78,6 +78,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 517a8e3634..d0b1b48ef0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * REPACK CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.43.5
On 2025-Feb-26, Antonin Houska wrote:
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params) * would work in most respects, but the index would only get marked as * indisclustered in the current database, leading to unexpected behavior * if CLUSTER were later invoked in another database. + * + * REPACK does not set indisclustered. XXX Not sure I understand the + * comment above: how can an attribute be set "only in the current + * database"? */
Regarding this XXX comment, what's going on here is this: a CLUSTER
command needs to remember the index that a table is clustered on. We
keep track of this in pg_index.indisclustered. But pg_index is a local
relation, not shared across databases -- so the current CLUSTER command
can effect the update on the current database's pg_index only, not on
other databases. So if the user were to run CLUSTER on one database
specifying an index, then connect to another one and expect CLUSTER
without specifying an index to honor the previously specified index,
that would not work. Naturally this is only a problem for shared
catalogs. Not being able to handle this for shared catalogs is not a
big loss.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Feb-26, Antonin Houska wrote:
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params) * would work in most respects, but the index would only get marked as * indisclustered in the current database, leading to unexpected behavior * if CLUSTER were later invoked in another database. + * + * REPACK does not set indisclustered. XXX Not sure I understand the + * comment above: how can an attribute be set "only in the current + * database"? */Regarding this XXX comment, what's going on here is this: a CLUSTER
command needs to remember the index that a table is clustered on. We
keep track of this in pg_index.indisclustered. But pg_index is a local
relation, not shared across databases -- so the current CLUSTER command
can effect the update on the current database's pg_index only, not on
other databases. So if the user were to run CLUSTER on one database
specifying an index, then connect to another one and expect CLUSTER
without specifying an index to honor the previously specified index,
that would not work. Naturally this is only a problem for shared
catalogs. Not being able to handle this for shared catalogs is not a
big loss.
Thanks for explanation. The reason I failed to understand this was probably
that I tried to imagine something worse.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.
I did look at 0002 again (and renamed the members of the new struct by
adding a p_ prefix, as well as fixing the references to the old names
that were in a few code comments here and there; I don't think these
changes are "substantive"), and ended up wondering why do we need that
change in the first place. According to the comment where the progress
restore function is called, it's because reorderbuffer.c uses a
subtransaction internally. But I went to look at reorderbuffer.c and
noticed that the subtransaction is only used "when using the SQL
function interface, because that creates a transaction already". So
maybe we should look into making REPACK use reorderbuffer without having
to open a transaction block.
I didn't do anything about that, in particular I didn't actually try to
run REPACK to see whether the transaction is needed. I'll be looking at
that in the next couple of days.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL. This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php
Attachments:
v09-0001-Add-REPACK-command.patchtext/x-diff; charset=utf-8Download
From 2383d9c8ca228df1f0ff7f19e94d6d42f5c34ad2 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 1/9] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 58 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1384 insertions(+), 235 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..5643edd614e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5926,6 +5934,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..54bb2362c84 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..84f3c3e3f2b
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..735a2a7703a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4da4dc84580..dfc95ee46b2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..b8209b2acd5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1262,6 +1262,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..9ae3d87e412 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
- PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1458,8 +1443,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1494,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1651,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1673,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 129c97fdf28..ebee88e474c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15727,7 +15727,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f0a7b87808d..61018482089 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..d53808a406e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11897,6 +11898,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17937,6 +17992,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18568,6 +18624,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 662ce46cbc2..ed24efc1a65 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 9a4d993e2bc..6886dfbb824 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4910,6 +4910,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..c2976905e4d 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..7644267e14f 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..d32a4d9f2db 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3916,6 +3916,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..0932d6fce5b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..da3d14bb97b 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..ed7df29b8e5 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..50d87af2fdf 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..e348e26fbfa 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d35..7e51d48be44 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -415,6 +415,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2499,6 +2500,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.39.5
v09-0002-Move-progress-related-fields-from-PgBackendStatu.patchtext/x-diff; charset=utf-8Download
From 748d8fca4c396b6e96d500ecdf4a3048bce97d60 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Thu, 20 Mar 2025 15:46:14 +0100
Subject: [PATCH v09 2/9] Move progress-related fields from PgBackendStatus to
PgBackendProgress
---
src/backend/access/heap/vacuumlazy.c | 4 +--
src/backend/commands/analyze.c | 2 +-
src/backend/utils/activity/backend_progress.c | 33 ++++++++++---------
src/backend/utils/activity/backend_status.c | 9 ++---
src/backend/utils/adt/pgstatfuncs.c | 6 ++--
src/include/utils/backend_progress.h | 15 ++++++++-
src/include/utils/backend_status.h | 14 ++------
7 files changed, 44 insertions(+), 39 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..76c8ec15dde 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1107,10 +1107,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* We bypass the changecount mechanism because this value is
* only updated by the calling process. We also rely on the
* above call to pgstat_progress_end_command() to not clear
- * the st_progress_param array.
+ * the st_progress.p_param array.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.p_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..8d88b665f18 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -815,7 +815,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
* only updated by the calling process.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.p_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 99a8c73bf04..17b5d87446b 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -19,8 +19,8 @@
/*-----------
* pgstat_progress_start_command() -
*
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry. Also, zero-initialize st_progress_param array.
+ * Set st_progress.p_command (and st_progress.p_command_target) in own backend
+ * entry. Also, zero-initialize st_progress.p_param array.
*-----------
*/
void
@@ -32,16 +32,17 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.p_command = cmdtype;
+ beentry->st_progress.p_command_target = relid;
+ MemSet(&beentry->st_progress.p_param, 0,
+ sizeof(beentry->st_progress.p_param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
/*-----------
* pgstat_progress_update_param() -
*
- * Update index'th member in st_progress_param[] of own backend entry.
+ * Update index'th member in st_progress.p_param[] of own backend entry.
*-----------
*/
void
@@ -55,14 +56,14 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.p_param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
/*-----------
* pgstat_progress_incr_param() -
*
- * Increment index'th member in st_progress_param[] of own backend entry.
+ * Increment index'th member in st_progress.p_param[] of own backend entry.
*-----------
*/
void
@@ -76,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.p_param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -113,7 +114,7 @@ pgstat_progress_parallel_incr_param(int index, int64 incr)
/*-----------
* pgstat_progress_update_multi_param() -
*
- * Update multiple members in st_progress_param[] of own backend entry.
+ * Update multiple members in st_progress.p_param[] of own backend entry.
* This is atomic; readers won't see intermediate states.
*-----------
*/
@@ -133,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.p_param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -142,8 +143,8 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
/*-----------
* pgstat_progress_end_command() -
*
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry. This signals the end of the command.
+ * Reset st_progress.p_command (and st_progress.p_command_target) in own
+ * backend entry. This signals the end of the command.
*-----------
*/
void
@@ -154,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.p_command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.p_command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.p_command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7681b4ba5a9..db2b4391969 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -318,13 +318,14 @@ pgstat_bestart_initial(void)
lbeentry.st_gss = false;
lbeentry.st_state = STATE_STARTING;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
+ lbeentry.st_progress.p_command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.p_command_target = InvalidOid;
+
/*
- * we don't zero st_progress_param here to save cycles; nobody should
- * examine it until st_progress_command has been set to something other
+ * we don't zero st_progress.p_param here to save cycles; nobody should
+ * examine it until st_progress.p_command has been set to something other
* than PROGRESS_COMMAND_INVALID
*/
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ed24efc1a65..0f27780abae 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -299,7 +299,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.p_command != cmdtype)
continue;
/* Value available to all callers */
@@ -309,9 +309,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.p_command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.p_param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index da3d14bb97b..10aaec9b15c 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -31,8 +31,21 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
-#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+#define PGSTAT_NUM_PROGRESS_PARAM 20
+typedef struct PgBackendProgress
+{
+ ProgressCommandType p_command;
+ Oid p_command_target;
+ int64 p_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 1c9b4fe14d0..8e024274d76 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -156,18 +156,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
--
2.39.5
v09-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patchtext/x-diff; charset=utf-8Download
From 633a6e42ad563bd0cc0dab0e60bfe7833871b68f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 3/9] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..2c336b47fdb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
- return snap;
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
+
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.39.5
v09-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/x-diff; charset=utf-8Download
From 2f312c8db1771e4da1fd06d0d3340c89519efa26 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 4/9] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we cannot get stronger
lock without releasing the weaker one first, we acquire the exclusive lock in
the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if REPACK CONCURRENTLY is running on the same
relation, and raises an ERROR if it is.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
application concurrently. However, this lag should not be more than a single
WAL segment.
---
doc/src/sgml/monitoring.sgml | 65 +-
doc/src/sgml/ref/repack.sgml | 116 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 2572 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 17 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 286 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 24 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 93 +-
src/include/commands/progress.h | 17 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 5 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 3 +-
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
41 files changed, 3489 insertions(+), 205 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5643edd614e..606736f279a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5821,14 +5821,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6044,14 +6065,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6132,6 +6174,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 84f3c3e3f2b..9ee640e3517 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,115 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..1be1ef22d1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2174,8 +2174,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing REPACK CONCURRENTLY, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dfc95ee46b2..6e228addb47 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -685,6 +689,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -705,6 +711,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -783,8 +791,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -839,7 +849,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -855,14 +865,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -874,7 +885,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -888,8 +899,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -900,9 +909,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -919,7 +966,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +981,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1051,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2023,6 +2097,53 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
return blocks_done;
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here?*/
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Miscellaneous callbacks for the heap AM
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..a46e1812b21 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,13 +955,14 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
/*
* Assert that the caller has registered the snapshot. This function
* doesn't care about the registration as such, but in general you
@@ -974,6 +975,20 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1082,6 +1097,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8209b2acd5..c301d83d9b2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,16 +1249,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1275,16 +1276,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9ae3d87e412..25a0b9c6119 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -76,14 +86,96 @@ typedef struct
((cmd) == CLUSTER_COMMAND_REPACK ? \
"repack" : "vacuum"))
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
+ "relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTE CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables repacked_rel_locator
+ * and repacked_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if REPACK CONCURRENTLY is running, before they start to do
+ * the actual changes (see is_concurrent_repack_in_progress()). Anything else
+ * must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, ClusterCommand cmd,
bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -91,8 +183,91 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
static Relation process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
Oid *indexOid_p);
@@ -151,8 +326,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_CLUSTER, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel,
+ ¶ms, &indexOid);
if (rel == NULL)
return;
}
@@ -202,7 +378,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -219,8 +396,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -240,10 +417,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -267,12 +444,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which commands is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -282,8 +465,53 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered, success;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VREPACK CONCURRENTLY after our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_repack_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
+
+ can_repack_concurrently(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -333,7 +561,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -348,7 +576,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -359,7 +587,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -370,7 +598,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -390,6 +618,11 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot %s a shared catalog", cmd_str)));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
/*
* Don't process temp tables of other backends ... their local buffer
@@ -411,8 +644,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
- cmd);
+ check_index_is_clusterable(OldHeap, indexOid, lmode, cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -429,7 +661,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -442,11 +675,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_repack(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose, cmd, concurrent);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -595,19 +859,86 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+void
+can_repack_concurrently(Relation rel)
+{
+ char relpersistence, replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -615,13 +946,81 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+ LOCKMODE lmode;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we spend notable amount of time to
+ * setup the logical decoding. Not sure though if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ }
if (index)
/* Mark the correct index as clustered */
@@ -629,7 +1028,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -645,30 +1043,51 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
- &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
+ cmd, &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -803,14 +1222,18 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- ClusterCommand cmd, bool *pSwapToastByContent,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, ClusterCommand cmd, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
@@ -829,6 +1252,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -855,8 +1279,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the REPACK CONCURRENTLY case, the lock does not help because we need
+ * to release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in RepackedRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -932,8 +1360,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -965,7 +1433,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -974,7 +1444,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at
+ * again. In the CONCURRENTLY case, we need to set it again before
+ * applying the concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1432,14 +1906,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1465,39 +1938,46 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit.
+ * We do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will never
+ * set indcheckxmin true for the indexes. This is OK even though in some
+ * sense we are building new indexes rather than rebuilding existing ones,
+ * because the new heap won't contain any HOT chains at all, let alone
+ * broken ones, so it can't be necessary to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1809,6 +2289,1855 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxRepackedRels = 0;
+
+Size
+RepackShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+ maxRepackedRels = max_replication_slots;
+
+ return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+
+ RepackedRelsHash = ShmemInitHash("Repacked Relations",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid, toastrelid;
+ RepackedRel key, *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ /*
+ * If we could enter the main fork the TOAST should succeed
+ * too. Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+ /* Make sure the reopened relcache entry is used, not the old one. */
+ rel = *rel_p;
+
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_repack(bool error)
+{
+ RepackedRel key;
+ RepackedRel *entry = NULL, *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ repacked_rel_toast = InvalidOid;
+ }
+ LWLockRelease(RepackedRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_repack(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel) || OidIsValid(repacked_rel_toast))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key, *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if REPACK CONCURRENTLY is already running for given relation, and if
+ * so, raise ERROR. The problem is that cluster_rel() needs to release its
+ * lock on the relation temporarily at some point, so our lock alone does not
+ * help. Commands that might break what cluster_rel() is doing should call
+ * this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_repack(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * ShareUpdateExclusiveLock, the check makes little sense because REPACK
+ * CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < ShareUpdateExclusiveLock)
+ return;
+
+ /*
+ * The caller has a lock which conflicts with REPACK CONCURRENTLY, so if
+ * that's not running now, it cannot start until the caller's transaction
+ * has completed.
+ */
+ if (is_concurrent_repack_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for REPACK CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds, i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ *
+ * This function was already called, but the relation was unlocked since
+ * (see begin_concurrent_repack()). check_catalog_changes() should catch
+ * any "disruptive" changes in the future.
+ */
+ can_repack_concurrently(rel);
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src, td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of REPACK CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td, td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != repacked_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_repack() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, repacked_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ repacked_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
+
+ /*
+ * Check if we can use logical decoding.
+ */
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot, *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw, *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc, *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old, ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr, end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all, *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+ /*
+ * In this special case we do not have the
+ * relcache reference, use OID instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
+
/*
* REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
*/
@@ -1822,6 +4151,7 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
Oid indexOid = InvalidOid;
MemoryContext repack_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -1838,22 +4168,55 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_REPACK, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel, ¶ms, &indexOid);
if (rel == NULL)
return;
}
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
*/
PreventInTransactionBlock(isTopLevel, "REPACK");
@@ -1869,6 +4232,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
bool rel_is_index;
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
if (OidIsValid(indexOid))
{
@@ -1885,13 +4250,15 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
CLUSTER_COMMAND_REPACK);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
rtcs = get_tables_to_repack(repack_context);
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
+
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1909,7 +4276,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd, ClusterParams *params,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel, ClusterParams *params,
Oid *indexOid_p)
{
Relation rel;
@@ -1919,12 +4287,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -1978,7 +4344,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e7854add178..df879c2a18d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ebee88e474c..4ffa9b41c88 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4528,6 +4528,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENTLY is in progress. If
+ * lockmode is too weak, cluster_rel() should detect incompatible DDLs
+ * executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5961,6 +5971,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 61018482089..6e914a7020a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1996,7 +1997,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2264,7 +2265,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2310,7 +2311,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d53808a406e..ea7ad798450 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11902,27 +11902,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK qualified_name repack_index_specification
+ REPACK opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2;
- n->indexname = $3;
+ n->concurrent = $2;
+ n->relation = $3;
+ n->indexname = $4;
n->params = NIL;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ | REPACK '(' utility_option_list ')' opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5;
- n->indexname = $6;
n->params = $3;
+ n->concurrent = $5;
+ n->relation = $6;
+ n->indexname = $7;
$$ = (Node *) n;
}
@@ -11933,6 +11936,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
+ n->concurrent = false;
$$ = (Node *) n;
}
@@ -11943,6 +11947,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
+ n->concurrent = false;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..00f7bbc5f59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
+ * only decode data changes of the table that it is processing, and the
+ * changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2c336b47fdb..da0a1d227e4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..1ef9b3cbfd7
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,286 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst, *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called
+ * apply_change(). Therefore we need flat copy (including TOAST) that
+ * we eventually copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index bf3ba3c2ae7..4ee4c474874 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1307,6 +1307,16 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENT is in
+ * progress. If lockmode is too weak, cluster_rel() should
+ * detect incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 17b5d87446b..fcd5d396b21 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.p_command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.p_command = backup->p_command;
+ beentry->st_progress.p_command_target = backup->p_command_target;
+ memcpy(MyBEEntry->st_progress.p_param, backup->p_param,
+ sizeof(beentry->st_progress.p_param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..ef04ed32cab 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -349,6 +349,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 4eb67720737..2f25cd86fe0 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1633,6 +1633,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9f54a9e72b7..679cc6be1d1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1252,6 +1253,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 6886dfbb824..cfdc9833715 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4911,18 +4911,26 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..bdeb2f83540 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -421,6 +421,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..b1ca73d6ea5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1637,6 +1640,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1649,6 +1656,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1657,6 +1666,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c2976905e4d..6fb5f5509c6 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,14 +52,91 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void can_repack_concurrently(Relation rel);
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -61,9 +144,15 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7644267e14f..6b1b1a4c1a7 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -67,10 +67,12 @@
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 8
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_REPACK */
#define PROGRESS_REPACK_COMMAND_REPACK 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d32a4d9f2db..e36a32b83b2 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3926,6 +3926,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..b0d81b736db 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,9 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK
+ * CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 932024b1b0b..fe9d85e5f95 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 10aaec9b15c..5be04c53eda 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -34,7 +34,7 @@ typedef enum ProgressCommandType
/*
* Any command which wishes can advertise that it is running by setting
- * command, command_target, and param[]. command_target should be the OID of
+ * ommand, command_target, and param[]. command_target should be the OID of
* the relation which the command targets (we assume there's just one, as this
* is meant for utility commands), but the meaning of each element in the
* param array is command-specific.
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..3409f942098 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d94fddd7cef..cb485d26f44 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -692,7 +695,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_repack_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 50d87af2fdf..587c0c85b02 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1969,17 +1969,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2055,17 +2055,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
--
2.39.5
v09-0005-Preserve-visibility-information-of-the-concurren.patchtext/x-diff; charset=utf-8Download
From 35c857f430e9d4f09ba7a8fd6ac901d7b166d40b Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 5/9] Preserve visibility information of the concurrent
data changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.
However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". Now the tuples
written into the new table storage have the same XID and command ID (CID) as
they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 76 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_repack/pgoutput_repack.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
src/include/utils/snapshot.h | 3 +
13 files changed, 389 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346ce..75d889ec72c 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1be1ef22d1e..c7d7cbe2f65 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2070,7 +2071,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -2086,15 +2087,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2725,7 +2727,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2782,11 +2785,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2803,6 +2806,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -3028,7 +3032,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3096,8 +3101,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3118,6 +3127,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3207,10 +3225,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3249,12 +3268,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3294,6 +3312,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4131,8 +4150,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -4142,7 +4165,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4497,10 +4521,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8833,7 +8857,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8844,10 +8869,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6e228addb47..485d22b9488 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -256,7 +256,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -279,7 +280,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -313,7 +315,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -331,8 +334,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..e766be7b81d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -126,6 +126,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -972,6 +984,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nRepackCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -991,6 +1005,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nRepackCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ RepackCurrentXids,
+ nRepackCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5640,6 +5669,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetRepackCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+ RepackCurrentXids = xip;
+ nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ * Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+ RepackCurrentXids = NULL;
+ nRepackCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 25a0b9c6119..8e8fe22d6d8 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -209,6 +209,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2965,6 +2966,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3120,6 +3124,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
tup_exist;
char *change_raw, *src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3188,8 +3193,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3200,6 +3227,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetRepackCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3212,11 +3241,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3236,10 +3268,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3247,6 +3299,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3257,6 +3310,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetRepackCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3274,18 +3329,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3295,6 +3368,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3305,7 +3379,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
}
@@ -3323,7 +3412,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3332,7 +3421,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
HeapTuple result = NULL;
/* XXX no instrumentation for now */
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
NULL, nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3404,6 +3493,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetRepackCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 00f7bbc5f59..5cdb6299d81 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
- * only decode data changes of the table that it is processing, and the
- * changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if REPACK CONCURRENTLY is being performed by this
+ * backend. If so, only decode data changes of the table that it is
+ * processing, and the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,60 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+ /*
+ * (Besides insertion into the main heap by CLUSTER CONCURRENTLY,
+ * this does happen when raw_heap_insert marks the TOAST record as
+ * HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +986,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index da0a1d227e4..3497466da2f 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 1ef9b3cbfd7..d42d93a8b64 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
RepackDecodingState *dstate;
+ Snapshot snapshot;
dstate = (RepackDecodingState *) ctx->output_writer_private;
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -264,6 +310,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -277,6 +328,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bdeb2f83540..b0c6f1d916f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -325,21 +325,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..fbb66d559b6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 6fb5f5509c6..ef3cb557516 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -73,6 +73,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -104,6 +112,8 @@ typedef struct RepackDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -124,6 +134,14 @@ typedef struct RepackDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} RepackDecodingState;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..014f27db7d7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
--
2.39.5
v09-0006-Add-regression-tests.patchtext/x-diff; charset=utf-8Download
From f2db64b5aa2827837d5ea88ba3f1bd195d920586 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 6/9] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 140 ++++++++++++++++++
6 files changed, 267 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8e8fe22d6d8..1dafca4531f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3716,6 +3717,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..49a736ed617
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..5aa8983f98d
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.39.5
v09-0007-Introduce-repack_max_xlock_time-configuration-va.patchtext/x-diff; charset=utf-8Download
From fa8d6a472038e8faad02fe6f12e3f0e68d0c2b17 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v09 7/9] Introduce repack_max_xlock_time configuration
variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 133 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 16 ++-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 293 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bdcefa8140b..f6f248080ea 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11213,6 +11213,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9ee640e3517..0c250689d13 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -188,7 +188,14 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 485d22b9488..9bb37eb83fb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1008,7 +1008,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1dafca4531f..0a2bacbb1df 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -108,6 +110,15 @@ RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
"relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -197,7 +208,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -214,13 +226,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3021,7 +3035,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3059,6 +3074,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3101,7 +3119,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3131,6 +3150,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3255,10 +3277,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it
+ * now. However, make sure the loop does not break if tup_old was set
+ * in the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3462,11 +3496,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -3475,10 +3513,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3490,7 +3537,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3500,6 +3547,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3660,6 +3729,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
RelReopenInfo *rri = NULL;
int nrel;
Relation *ind_refs_all, *ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3739,7 +3810,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3861,9 +3933,38 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 97cfd6e5a82..0ba416a1982 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,8 +39,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2837,6 +2838,19 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop(
+ "The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9f31e4071c7..d25c67f7047 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -761,6 +761,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index ef3cb557516..f5600bf4f62 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -154,7 +156,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void can_repack_concurrently(Relation rel);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index 49a736ed617..f2728d94222 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 5aa8983f98d..0f45f9d2544 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.39.5
v09-0008-Enable-logical-decoding-transiently-only-for-REP.patchtext/x-diff; charset=utf-8Download
From 07daab55b8b4bb86dce43eb4dcc252bc5b736376 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:21 +0100
Subject: [PATCH v09 8/9] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
src/backend/access/transam/parallel.c | 8 ++
src/backend/access/transam/xact.c | 106 ++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 107 ++++++++++++++----
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/standby.c | 4 +-
src/include/access/xlog.h | 15 ++-
src/include/commands/cluster.h | 1 +
src/include/utils/rel.h | 6 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
12 files changed, 216 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e766be7b81d..479fe62b1c7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -138,6 +139,12 @@ static TransactionId *ParallelCurrentXids;
static int nRepackCurrentXids = 0;
static TransactionId *RepackCurrentXids = NULL;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -649,6 +656,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -663,6 +671,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -692,20 +726,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -730,6 +750,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2244,6 +2312,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b6c694a3f7..1b131e1436f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0a2bacbb1df..fa6db0932c0 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1298,7 +1298,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
*
* In the REPACK CONCURRENTLY case, the lock does not help because we need
* to release it temporarily at some point. Instead, we expect VACUUM /
- * CLUSTER to skip tables which are present in RepackedRelsHash.
+ * CLUSTER to skip tables which are present in repackedRels->hashtable.
*/
if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
@@ -2317,7 +2317,16 @@ typedef struct RepackedRel
Oid dbid;
} RepackedRel;
-static HTAB *RepackedRelsHash = NULL;
+typedef struct RepackedRels
+{
+ /* Hashtable of RepackedRel elements. */
+ HTAB *hashtable;
+
+ /* The number of elements in the hashtable.. */
+ pg_atomic_uint32 nrels;
+} RepackedRels;
+
+static RepackedRels *repackedRels = NULL;
/* Maximum number of entries in the hashtable. */
static int maxRepackedRels = 0;
@@ -2325,28 +2334,44 @@ static int maxRepackedRels = 0;
Size
RepackShmemSize(void)
{
+ Size result;
+
+ result = sizeof(RepackedRels);
+
/*
* A replication slot is needed for the processing, so use this GUC to
* allocate memory for the hashtable.
*/
maxRepackedRels = max_replication_slots;
- return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ result += hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ return result;
}
void
RepackShmemInit(void)
{
+ bool found;
HASHCTL info;
+ repackedRels = ShmemInitStruct("Repacked Relations",
+ sizeof(RepackedRels),
+ &found);
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&repackedRels->nrels, 0);
+ }
+ else
+ Assert(found);
+
info.keysize = sizeof(RepackedRel);
info.entrysize = info.keysize;
-
- RepackedRelsHash = ShmemInitHash("Repacked Relations",
- maxRepackedRels,
- maxRepackedRels,
- &info,
- HASH_ELEM | HASH_BLOBS);
+ repackedRels->hashtable = ShmemInitHash("Repacked Relations Hash",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
}
/*
@@ -2378,12 +2403,13 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
RelReopenInfo rri[2];
int nrel;
static bool before_shmem_exit_callback_setup = false;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
relid = RelationGetRelid(rel);
/*
- * Make sure that we do not leave an entry in RepackedRelsHash if exiting
- * due to FATAL.
+ * Make sure that we do not leave an entry in repackedRels->Hashtable if
+ * exiting due to FATAL.
*/
if (!before_shmem_exit_callback_setup)
{
@@ -2398,7 +2424,7 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
*entered_p = false;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL, &found);
if (found)
{
/*
@@ -2416,6 +2442,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
/*
* Even if the insertion of TOAST relid should fail below, the caller has
* to do cleanup.
@@ -2443,7 +2473,8 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
{
key.relid = toastrelid;
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL,
+ &found);
if (found)
/*
* If we could enter the main fork the TOAST should succeed
@@ -2457,6 +2488,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
Assert(!OidIsValid(repacked_rel_toast));
repacked_rel_toast = toastrelid;
}
@@ -2536,6 +2571,7 @@ end_concurrent_repack(bool error)
RepackedRel *entry = NULL, *entry_toast = NULL;
Oid relid = repacked_rel;
Oid toastrelid = repacked_rel_toast;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
/* Remove the relation from the hash if we managed to insert one. */
if (OidIsValid(repacked_rel))
@@ -2544,23 +2580,32 @@ end_concurrent_repack(bool error)
key.relid = repacked_rel;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
- entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ entry = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
+ NULL);
/*
* By clearing this variable we also disable
* cluster_before_shmem_exit_callback().
*/
repacked_rel = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
/* Remove the TOAST relation if there is one. */
if (OidIsValid(repacked_rel_toast))
{
key.relid = repacked_rel_toast;
- entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ entry_toast = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
NULL);
repacked_rel_toast = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
LWLockRelease(RepackedRelsLock);
@@ -2626,7 +2671,7 @@ end_concurrent_repack(bool error)
}
/*
- * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
*/
static void
cluster_before_shmem_exit_callback(int code, Datum arg)
@@ -2637,24 +2682,48 @@ cluster_before_shmem_exit_callback(int code, Datum arg)
/*
* Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
*/
bool
is_concurrent_repack_in_progress(Oid relid)
{
RepackedRel key, *entry;
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just use the counter.
+ */
+ if (!OidIsValid(relid))
+ {
+ if (pg_atomic_read_u32(&repackedRels->nrels) > 0)
+ return true;
+ else
+ return false;
+ }
+
+ /* For particular relation we need to search in the hashtable. */
memset(&key, 0, sizeof(key));
key.relid = relid;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_SHARED);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ hash_search(repackedRels->hashtable, &key, HASH_FIND, NULL);
LWLockRelease(RepackedRelsLock);
return entry != NULL;
}
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
+}
+
/*
* Check if REPACK CONCURRENTLY is already running for given relation, and if
* so, raise ERROR. The problem is that cluster_rel() needs to release its
@@ -2949,8 +3018,8 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
- * regarding logical decoding, treated like a catalog.
+ * table we are processing is present in repackedRels->hashtable and
+ * therefore, regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
NIL,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..e5790d3fe84 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..413bcc1addb 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1313,13 +1313,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index f5600bf4f62..3ed3066b364 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -173,6 +173,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
extern Size RepackShmemSize(void);
extern void RepackShmemInit(void);
extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index cb485d26f44..88bacc109ff 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -710,12 +710,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
--
2.39.5
v09-0009-Call-logical_rewrite_heap_tuple-when-applying-co.patchtext/x-diff; charset=utf-8Download
From 90c9a1caae1463b5a08dda64be3fee1feea9f03b Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:21 +0100
Subject: [PATCH v09 9/9] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. REPACK CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by REPACK, this
command must record the information on individual tuples being moved from the
old file to the new one. This is what logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. REPACK CONCURRENTLY is processing a relation, but once it has released all
the locks (in order to get the exclusive lock), another backend runs REPACK
CONCURRENTLY on the same table. Since the relation is treated as a system
catalog while these commands are processing it (so it can be scanned using a
historic snapshot during the "initial load"), it is important that the 2nd
backend does not break decoding of the "combo CIDs" performed by the 1st
backend.
However, it's not practical to let multiple backends run REPACK CONCURRENTLY
on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 59 +++++-----
src/backend/commands/cluster.c | 110 +++++++++++++++---
src/backend/replication/logical/decode.c | 41 ++++++-
.../pgoutput_repack/pgoutput_repack.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 191 insertions(+), 57 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9bb37eb83fb..994e6510333 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -734,7 +734,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..94b603423db 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
- MemoryContextSwitchTo(old_cxt);
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
logical_begin_heap_rewrite(state);
+ MemoryContextSwitchTo(old_cxt);
+
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index fa6db0932c0..bb8b6824aec 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -209,17 +210,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -233,7 +238,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3189,7 +3195,7 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot, *ident_slot;
HeapTuple tup_old = NULL;
@@ -3263,7 +3269,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3271,7 +3278,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key, tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -3313,11 +3320,23 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetRepackCurrentXids();
@@ -3370,9 +3389,12 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3382,6 +3404,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3397,6 +3422,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3430,15 +3463,22 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap, tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3448,7 +3488,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3458,6 +3498,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3481,8 +3525,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3491,7 +3536,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3573,7 +3621,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
RepackDecodingState *dstate;
@@ -3606,7 +3655,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3791,6 +3841,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
bool is_system_catalog;
Oid ident_idx_old, ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr, end_of_wal;
@@ -3876,11 +3927,26 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -4002,6 +4068,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
* repack_max_xlock_time is not exceeded.
@@ -4029,11 +4100,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5cdb6299d81..8ad5612b888 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -983,11 +983,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1012,6 +1014,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1033,11 +1042,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum, new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1054,6 +1066,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1063,6 +1076,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferAllocTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1081,6 +1101,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1099,11 +1127,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1135,6 +1164,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index d42d93a8b64..71b010c3516 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -33,7 +33,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -168,7 +168,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -186,10 +187,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -202,7 +203,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -236,13 +238,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -315,6 +317,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 99c3f362adc..eebda35c7cb 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 3ed3066b364..db029c62cf1 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -78,6 +78,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..03f89cae038 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * REPACK CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.39.5
On Thu, Mar 20, 2025 at 1:32 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.I did look at 0002 again (and renamed the members of the new struct by
adding a p_ prefix, as well as fixing the references to the old names
that were in a few code comments here and there; I don't think these
changes are "substantive"), and ended up wondering why do we need that
change in the first place. According to the comment where the progress
restore function is called, it's because reorderbuffer.c uses a
subtransaction internally. But I went to look at reorderbuffer.c and
noticed that the subtransaction is only used "when using the SQL
function interface, because that creates a transaction already". So
maybe we should look into making REPACK use reorderbuffer without having
to open a transaction block.I didn't do anything about that, in particular I didn't actually try to
run REPACK to see whether the transaction is needed. I'll be looking at
that in the next couple of days.
Is there a README or a long comment in here someplace that is a good
place to read to understand the overall design of this feature?
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> wrote:
Is there a README or a long comment in here someplace that is a good
place to read to understand the overall design of this feature?
I tried to explain how it works in the commit messages. The one in 0004 is
probably the most important one.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On Thu, Mar 20, 2025 at 2:09 PM Antonin Houska <ah@cybertec.at> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Is there a README or a long comment in here someplace that is a good
place to read to understand the overall design of this feature?I tried to explain how it works in the commit messages. The one in 0004 is
probably the most important one.
Thanks. A couple of comments/questions:
- I don't understand why this commit message seems to think that we
can't acquire a stronger lock while already holding a weaker one. We
can do that, and in some cases we do precisely that. Such locking
patterns can result in deadlock e.g. if I take AccessShareLock and you
take AccessShareLock and then I tried to upgrade to
AccessExclusiveLock and then you try to upgrade to
AccessExclusiveLock, somebody is going to have to ERROR out. But that
doesn't keep us from doing that in some places where it seems better
than the alternatives, and the alternative chosen by the patch
(possibly discovering at the very end that all our work has been in
vain) does not seem better than risking a deadlock.
- On what basis do you make the statement in the last paragraph that
the decoding-related lag should not exceed one WAL segment? I guess
logical decoding probably keeps up pretty well most of the time but
this seems like a very strong guarantee for something I didn't know we
had any kind of guarantee about.
- What happens if we crash?
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 20, 2025 at 2:09 PM Antonin Houska <ah@cybertec.at> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Is there a README or a long comment in here someplace that is a good
place to read to understand the overall design of this feature?I tried to explain how it works in the commit messages. The one in 0004 is
probably the most important one.Thanks. A couple of comments/questions:
- I don't understand why this commit message seems to think that we
can't acquire a stronger lock while already holding a weaker one. We
can do that, and in some cases we do precisely that.
Can you please give me an example? I don't recall seeing a lock upgrade in the
tree. That's the reason I tried rather hard to avoid that.
Such locking
patterns can result in deadlock e.g. if I take AccessShareLock and you
take AccessShareLock and then I tried to upgrade to
AccessExclusiveLock and then you try to upgrade to
AccessExclusiveLock, somebody is going to have to ERROR out. But that
doesn't keep us from doing that in some places where it seems better
than the alternatives, and the alternative chosen by the patch
(possibly discovering at the very end that all our work has been in
vain) does not seem better than risking a deadlock.
I see. Only the backends that do upgrade their lock are exposed to the risk of
deadlock, e.g. two backends running REPACK CONCURRENTLY on the same table, and
that should not happen too often.
I'll consider your objection - it should make the patch a bit simpler.
- On what basis do you make the statement in the last paragraph that
the decoding-related lag should not exceed one WAL segment? I guess
logical decoding probably keeps up pretty well most of the time but
this seems like a very strong guarantee for something I didn't know we
had any kind of guarantee about.
The patch itself does guarantee that by checking the amount of unprocessed WAL
regularly when it's copying the data into the new table. If too much WAL
appears to be unprocessed, it enforces the decoding before the copying is
resumed.
The WAL decoding during the "initial load" phase can actually be handled by a
background worker (not sure it's necessary in the initial implementation),
which would make a significant lag rather unlikely. But even then we should
probably enforce certain limit on the lag (e.g. because background worker is
not guaranteed to start).
- What happens if we crash?
The replication slot we create is RS_TEMPORARY, so it disappears after
restart. Everything else is as if the current implementation of CLUSTER ends
due to crash.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.
Thanks.
I did look at 0002 again (and renamed the members of the new struct by
adding a p_ prefix, as well as fixing the references to the old names
that were in a few code comments here and there; I don't think these
changes are "substantive"), and ended up wondering why do we need that
change in the first place. According to the comment where the progress
restore function is called, it's because reorderbuffer.c uses a
subtransaction internally. But I went to look at reorderbuffer.c and
noticed that the subtransaction is only used "when using the SQL
function interface, because that creates a transaction already". So
maybe we should look into making REPACK use reorderbuffer without having
to open a transaction block.
Which part of reorderbuffer.c do you mean? ISTM that the use of subransactions
is more extensive. At least ReorderBufferImmediateInvalidation() appears to
rely on it, which in turn is called by xact_decode().
(I don't claim that saving and restoring the progress state is perfect, but I
don't have better idea right now.)
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On 2025-Mar-22, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.
I rebased again, fixing a compiler warning reported by CI and applying
pgindent to each individual patch. I'm slowly starting to become more
familiar with the whole of this new code.
I did look at 0002 again [...], and ended up wondering why do we need that
change in the first place. According to the comment where the progress
restore function is called, it's because reorderbuffer.c uses a
subtransaction internally. But I went to look at reorderbuffer.c and
noticed that the subtransaction is only used "when using the SQL
function interface, because that creates a transaction already". So
maybe we should look into making REPACK use reorderbuffer without having
to open a transaction block.Which part of reorderbuffer.c do you mean? ISTM that the use of subransactions
is more extensive. At least ReorderBufferImmediateInvalidation() appears to
rely on it, which in turn is called by xact_decode().
Ah, right, I was not looking hard enough. Something to keep in mind --
though I'm still not convinced that it's best to achieve this by
introducng a mechanism to restore progress state. Maybe allowing a
transaction to abort without clobbering the progress state somehow (not
trivial to implement at present though, because of layers of functions
you need to traverse with such a flag; maybe have a global in xact.c
that you set by calling a function? Not sure -- might be worse.) Not a
super critical consideration, but this point prevents me from pushing
patch 0002 here, as it may turn out that it's not needed.
But nothing prevents me from pushing 0003, so I'll see about doing that
soon, unless I see some other problem.
I also noticed that CI is complaining of a problem in Windows, which is
easily reproducible in non-Windows by defining EXEC_BACKEND. The
backtrace is this:
#0 0x000055d4fc24fe96 in hash_search (hashp=0x5606dc2a8c88, keyPtr=0x7ffeab341928, action=HASH_FIND, foundPtr=0x0)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/hash/dynahash.c:960
960 return hash_search_with_hash_value(hashp,
(gdb) bt
#0 0x000055d4fc24fe96 in hash_search (hashp=0x5606dc2a8c88, keyPtr=0x7ffeab341928, action=HASH_FIND, foundPtr=0x0)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/hash/dynahash.c:960
#1 0x000055d4fbea0a46 in is_concurrent_repack_in_progress (relid=21973)
at ../../../../../../../../pgsql/source/master/src/backend/commands/cluster.c:2729
#2 is_concurrent_repack_in_progress (relid=relid@entry=2964)
at ../../../../../../../../pgsql/source/master/src/backend/commands/cluster.c:2706
#3 0x000055d4fc237a87 in RelationBuildDesc (targetRelId=2964, insertIt=insertIt@entry=true)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/cache/relcache.c:1257
#4 0x000055d4fc239456 in RelationIdGetRelation (relationId=<optimized out>, relationId@entry=2964)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/cache/relcache.c:2105
So apparently we're trying to dereference a hash table which isn't
properly set up in the child process.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos. Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Attachments:
v10-0001-Add-REPACK-command.patchtext/x-diff; charset=utf-8Download
From 5a1bf17b520b759a1c048953b186fd6d14861664 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 26 Feb 2025 09:17:20 +0100
Subject: [PATCH v10 1/9] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 58 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1384 insertions(+), 235 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0960f5ba94a..8776f51844b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5938,6 +5946,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..54bb2362c84 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..84f3c3e3f2b
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..735a2a7703a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..18e349c3466 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..5de46bcac52 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1262,6 +1262,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..9ae3d87e412 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
- PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1458,8 +1443,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1494,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1651,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1673,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 778e956b1ff..e59ea0468c2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15739,7 +15739,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f0a7b87808d..61018482089 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..d53808a406e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11897,6 +11898,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17937,6 +17992,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18568,6 +18624,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..ddec4914ea5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 98951aef82c..31271786f21 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4913,6 +4913,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..c2976905e4d 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..7644267e14f 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..d32a4d9f2db 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3916,6 +3916,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..0932d6fce5b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..da3d14bb97b 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..ed7df29b8e5 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..84ca2dc3778 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..e348e26fbfa 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3fbf5a4c212..2ff996746af 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -415,6 +415,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2500,6 +2501,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.39.5
v10-0002-Move-progress-related-fields-from-PgBackendStatu.patchtext/x-diff; charset=utf-8Download
From 0eaf44ada190af640b0306ccac388857e314e1ad Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Thu, 20 Mar 2025 15:46:14 +0100
Subject: [PATCH v10 2/9] Move progress-related fields from PgBackendStatus to
PgBackendProgress
---
src/backend/access/heap/vacuumlazy.c | 4 +--
src/backend/commands/analyze.c | 2 +-
src/backend/utils/activity/backend_progress.c | 33 ++++++++++---------
src/backend/utils/activity/backend_status.c | 9 ++---
src/backend/utils/adt/pgstatfuncs.c | 6 ++--
src/include/utils/backend_progress.h | 15 ++++++++-
src/include/utils/backend_status.h | 14 ++------
src/tools/pgindent/typedefs.list | 1 +
8 files changed, 45 insertions(+), 39 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..76c8ec15dde 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1107,10 +1107,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
* We bypass the changecount mechanism because this value is
* only updated by the calling process. We also rely on the
* above call to pgstat_progress_end_command() to not clear
- * the st_progress_param array.
+ * the st_progress.p_param array.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.p_param[PROGRESS_VACUUM_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..8d88b665f18 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -815,7 +815,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
* only updated by the calling process.
*/
appendStringInfo(&buf, _("delay time: %.3f ms\n"),
- (double) MyBEEntry->st_progress_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
+ (double) MyBEEntry->st_progress.p_param[PROGRESS_ANALYZE_DELAY_TIME] / 1000000.0);
}
if (track_io_timing)
{
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 99a8c73bf04..17b5d87446b 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -19,8 +19,8 @@
/*-----------
* pgstat_progress_start_command() -
*
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry. Also, zero-initialize st_progress_param array.
+ * Set st_progress.p_command (and st_progress.p_command_target) in own backend
+ * entry. Also, zero-initialize st_progress.p_param array.
*-----------
*/
void
@@ -32,16 +32,17 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = cmdtype;
- beentry->st_progress_command_target = relid;
- MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+ beentry->st_progress.p_command = cmdtype;
+ beentry->st_progress.p_command_target = relid;
+ MemSet(&beentry->st_progress.p_param, 0,
+ sizeof(beentry->st_progress.p_param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
/*-----------
* pgstat_progress_update_param() -
*
- * Update index'th member in st_progress_param[] of own backend entry.
+ * Update index'th member in st_progress.p_param[] of own backend entry.
*-----------
*/
void
@@ -55,14 +56,14 @@ pgstat_progress_update_param(int index, int64 val)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] = val;
+ beentry->st_progress.p_param[index] = val;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
/*-----------
* pgstat_progress_incr_param() -
*
- * Increment index'th member in st_progress_param[] of own backend entry.
+ * Increment index'th member in st_progress.p_param[] of own backend entry.
*-----------
*/
void
@@ -76,7 +77,7 @@ pgstat_progress_incr_param(int index, int64 incr)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_param[index] += incr;
+ beentry->st_progress.p_param[index] += incr;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
@@ -113,7 +114,7 @@ pgstat_progress_parallel_incr_param(int index, int64 incr)
/*-----------
* pgstat_progress_update_multi_param() -
*
- * Update multiple members in st_progress_param[] of own backend entry.
+ * Update multiple members in st_progress.p_param[] of own backend entry.
* This is atomic; readers won't see intermediate states.
*-----------
*/
@@ -133,7 +134,7 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
- beentry->st_progress_param[index[i]] = val[i];
+ beentry->st_progress.p_param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
@@ -142,8 +143,8 @@ pgstat_progress_update_multi_param(int nparam, const int *index,
/*-----------
* pgstat_progress_end_command() -
*
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry. This signals the end of the command.
+ * Reset st_progress.p_command (and st_progress.p_command_target) in own
+ * backend entry. This signals the end of the command.
*-----------
*/
void
@@ -154,11 +155,11 @@ pgstat_progress_end_command(void)
if (!beentry || !pgstat_track_activities)
return;
- if (beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+ if (beentry->st_progress.p_command == PROGRESS_COMMAND_INVALID)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
- beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
- beentry->st_progress_command_target = InvalidOid;
+ beentry->st_progress.p_command = PROGRESS_COMMAND_INVALID;
+ beentry->st_progress.p_command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index e1576e64b6d..41d951267b7 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -318,14 +318,15 @@ pgstat_bestart_initial(void)
lbeentry.st_gss = false;
lbeentry.st_state = STATE_STARTING;
- lbeentry.st_progress_command = PROGRESS_COMMAND_INVALID;
- lbeentry.st_progress_command_target = InvalidOid;
lbeentry.st_query_id = UINT64CONST(0);
lbeentry.st_plan_id = UINT64CONST(0);
+ lbeentry.st_progress.p_command = PROGRESS_COMMAND_INVALID;
+ lbeentry.st_progress.p_command_target = InvalidOid;
+
/*
- * we don't zero st_progress_param here to save cycles; nobody should
- * examine it until st_progress_command has been set to something other
+ * we don't zero st_progress.p_param here to save cycles; nobody should
+ * examine it until st_progress.p_command has been set to something other
* than PROGRESS_COMMAND_INVALID
*/
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ddec4914ea5..eb34704d230 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -299,7 +299,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
* Report values for only those backends which are running the given
* command.
*/
- if (beentry->st_progress_command != cmdtype)
+ if (beentry->st_progress.p_command != cmdtype)
continue;
/* Value available to all callers */
@@ -309,9 +309,9 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
/* show rest of the values including relid only to role members */
if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
{
- values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
+ values[2] = ObjectIdGetDatum(beentry->st_progress.p_command_target);
for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
- values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
+ values[i + 3] = Int64GetDatum(beentry->st_progress.p_param[i]);
}
else
{
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index da3d14bb97b..10aaec9b15c 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -31,8 +31,21 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
-#define PGSTAT_NUM_PROGRESS_PARAM 20
+/*
+ * Any command which wishes can advertise that it is running by setting
+ * command, command_target, and param[]. command_target should be the OID of
+ * the relation which the command targets (we assume there's just one, as this
+ * is meant for utility commands), but the meaning of each element in the
+ * param array is command-specific.
+ */
+#define PGSTAT_NUM_PROGRESS_PARAM 20
+typedef struct PgBackendProgress
+{
+ ProgressCommandType p_command;
+ Oid p_command_target;
+ int64 p_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendProgress;
extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
Oid relid);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 430ccd7d78e..ffe3804e07e 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -156,18 +156,8 @@ typedef struct PgBackendStatus
*/
char *st_activity_raw;
- /*
- * Command progress reporting. Any command which wishes can advertise
- * that it is running by setting st_progress_command,
- * st_progress_command_target, and st_progress_param[].
- * st_progress_command_target should be the OID of the relation which the
- * command targets (we assume there's just one, as this is meant for
- * utility commands), but the meaning of each element in the
- * st_progress_param array is command-specific.
- */
- ProgressCommandType st_progress_command;
- Oid st_progress_command_target;
- int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+ /* Command progress reporting. */
+ PgBackendProgress st_progress;
/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2ff996746af..01246732456 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2155,6 +2155,7 @@ PgAioTargetInfo
PgAioWaitRef
PgArchData
PgBackendGSSStatus
+PgBackendProgress
PgBackendSSLStatus
PgBackendStatus
PgBenchExpr
--
2.39.5
v10-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patchtext/x-diff; charset=utf-8Download
From 9ce579893454e9729fd968bed650da8d82f4766e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:08:08 +0100
Subject: [PATCH v10 3/9] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..e5d2a583ce6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
- return snap;
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
+
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.39.5
v10-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/x-diff; charset=utf-8Download
From 61676a7bee36888da61e381d78a7b7b79e8b4cef Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:08:42 +0100
Subject: [PATCH v10 4/9] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we cannot get stronger
lock without releasing the weaker one first, we acquire the exclusive lock in
the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file. Note that, before creating
that snapshot, we need to make sure that all the other backends treat the
relation as a system catalog: in particular, they must log information on new
command IDs (CIDs). We achieve that by adding the relation ID into a shared
hash table and waiting until all the transactions currently writing into the
table (i.e. transactions possibly not aware of the new entry) have finished.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
While copying the data into the new file, we hold a lock that prevents
applications from changing the relation tuple descriptor (tuples inserted into
the old file must fit into the new file). However, as we have to release that
lock before getting the exclusive one, it's possible that someone adds or
drops a column, or changes the data type of an existing one. Therefore we have
to check the tuple descriptor before we swap the files. If we find out that
the tuple descriptor changed, ERROR is raised and all the changes are rolled
back. Since a lot of effort can be wasted in such a case, the ALTER TABLE
command also tries to check if REPACK CONCURRENTLY is running on the same
relation, and raises an ERROR if it is.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
application concurrently. However, this lag should not be more than a single
WAL segment.
---
doc/src/sgml/monitoring.sgml | 65 +-
doc/src/sgml/ref/repack.sgml | 116 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 8 +-
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 2602 ++++++++++++++++-
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 11 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 17 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 ++
src/backend/storage/ipc/ipci.c | 3 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/activity/backend_progress.c | 16 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 5 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 93 +-
src/include/commands/progress.h | 17 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 7 +
42 files changed, 3527 insertions(+), 204 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8776f51844b..d850b69c82a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5833,14 +5833,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6056,14 +6077,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6144,6 +6186,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 84f3c3e3f2b..9ee640e3517 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,115 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..1be1ef22d1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2174,8 +2174,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * For the main heap (as opposed to TOAST), we only receive
+ * HEAP_INSERT_NO_LOGICAL when doing REPACK CONCURRENTLY, in which
+ * case the visibility information does not change. Therefore, there's
+ * no need to update the decoding snapshot.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 18e349c3466..371afa6ad59 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -685,6 +689,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -705,6 +711,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -783,8 +791,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -839,7 +849,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -855,14 +865,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -874,7 +885,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -888,8 +899,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -900,9 +909,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -919,7 +966,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +981,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1051,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2023,6 +2097,53 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
return blocks_done;
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here? */
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Miscellaneous callbacks for the heap AM
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..a46e1812b21 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,13 +955,14 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
/*
* Assert that the caller has registered the snapshot. This function
* doesn't care about the registration as such, but in general you
@@ -974,6 +975,20 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1082,6 +1097,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5de46bcac52..70265e5e701 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,16 +1249,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1275,16 +1276,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9ae3d87e412..90e43f12417 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -76,14 +86,97 @@ typedef struct
((cmd) == CLUSTER_COMMAND_REPACK ? \
"repack" : "vacuum"))
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
+ "relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+ ExprContext *econtext;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/*
+ * Catalog information to check if another backend changed the relation in
+ * such a way that makes CLUSTE CONCURRENTLY unable to continue. Such changes
+ * are possible because cluster_rel() has to release its lock on the relation
+ * in order to acquire AccessExclusiveLock that it needs to swap the relation
+ * files.
+ *
+ * The most obvious problem is that the tuple descriptor has changed, since
+ * then the tuples we try to insert into the new storage are not guaranteed to
+ * fit into the storage.
+ *
+ * Another problem is relfilenode changed by another backend. It's not
+ * necessarily a correctness issue (e.g. when the other backend ran
+ * cluster_rel()), but it's safer for us to terminate the table processing in
+ * such cases. However, this information is also needs to be checked during
+ * logical decoding, so we store it in global variables repacked_rel_locator
+ * and repacked_rel_toast_locator above.
+ *
+ * Where possible, commands which might change the relation in an incompatible
+ * way should check if REPACK CONCURRENTLY is running, before they start to do
+ * the actual changes (see is_concurrent_repack_in_progress()). Anything else
+ * must be caught by check_catalog_changes(), which uses this structure.
+ */
+typedef struct CatalogState
+{
+ /* Tuple descriptor of the relation. */
+ TupleDesc tupdesc;
+
+ /* The number of indexes tracked. */
+ int ninds;
+ /* The index OIDs. */
+ Oid *ind_oids;
+ /* The index tuple descriptors. */
+ TupleDesc *ind_tupdescs;
+
+ /* The following are copies of the corresponding fields of pg_class. */
+ char relpersistence;
+ char replident;
+
+ /* rd_replidindex */
+ Oid replidindex;
+} CatalogState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool concurrent);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, ClusterCommand cmd,
bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -91,8 +184,91 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
+static CatalogState *get_catalog_state(Relation rel);
+static void free_catalog_state(CatalogState *state);
+static void check_catalog_changes(Relation rel, CatalogState *cat_state);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
+
+/*
+ * Use this API when relation needs to be unlocked, closed and re-opened. If
+ * the relation got dropped while being unlocked, raise ERROR that mentions
+ * the relation name rather than OID.
+ */
+typedef struct RelReopenInfo
+{
+ /*
+ * The relation to be closed. Pointer to the value is stored here so that
+ * the user gets his reference updated automatically on re-opening.
+ *
+ * When calling unlock_and_close_relations(), 'relid' can be passed
+ * instead of 'rel_p' when the caller only needs to gather information for
+ * subsequent opening.
+ */
+ Relation *rel_p;
+ Oid relid;
+
+ char relkind;
+ LOCKMODE lockmode_orig; /* The existing lock mode */
+ LOCKMODE lockmode_new; /* The lock mode after the relation is
+ * re-opened */
+
+ char *relname; /* Relation name, initialized automatically. */
+} RelReopenInfo;
+
+static void init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p,
+ Oid relid, LOCKMODE lockmode_orig,
+ LOCKMODE lockmode_new);
+static void unlock_and_close_relations(RelReopenInfo *rels, int nrel);
+static void reopen_relations(RelReopenInfo *rels, int nrel);
static Relation process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
Oid *indexOid_p);
@@ -151,8 +327,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_CLUSTER, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel,
+ ¶ms, &indexOid);
if (rel == NULL)
return;
}
@@ -202,7 +379,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -219,8 +397,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -240,10 +418,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -267,12 +445,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which commands is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -282,8 +466,54 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+ bool entered,
+ success;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /*
+ * Skip the relation if it's being processed concurrently. In such a case,
+ * we cannot rely on a lock because the other backend needs to release it
+ * temporarily at some point.
+ *
+ * This check should not take place until we have a lock that prevents
+ * another backend from starting VREPACK CONCURRENTLY after our check.
+ */
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+ if (is_concurrent_repack_in_progress(tableOid))
+ {
+ ereport(NOTICE,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(OldHeap))));
+ table_close(OldHeap, lmode);
+ return;
+ }
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
+
+ can_repack_concurrently(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -333,7 +563,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -348,7 +578,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -359,7 +589,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -370,7 +600,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
goto out;
}
}
@@ -391,6 +621,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot %s a shared catalog", cmd_str)));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -411,8 +647,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
- cmd);
+ check_index_is_clusterable(OldHeap, indexOid, lmode, cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -429,7 +664,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -442,11 +678,42 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ entered = false;
+ success = false;
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure other transactions treat this
+ * table as if it was a system / user catalog, and WAL the relevant
+ * additional information. ERROR is raised if another backend is
+ * processing the same table.
+ */
+ if (concurrent)
+ {
+ Relation *index_p = index ? &index : NULL;
+
+ begin_concurrent_repack(&OldHeap, index_p, &entered);
+ }
+
+ rebuild_relation(OldHeap, index, verbose, cmd, concurrent);
+ success = true;
+ }
+ PG_FINALLY();
+ {
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -595,19 +862,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+void
+can_repack_concurrently(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK (CONCURRENTLY) is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool concurrent)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -615,13 +950,83 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+ CatalogState *cat_state = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+ RelReopenInfo rri[2];
+ int nrel;
+
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ /*
+ * Gather catalog information so that we can check later if the old
+ * relation has not changed while unlocked.
+ *
+ * Since this function also checks if the relation can be processed,
+ * it's important to call it before we spend notable amount of time to
+ * setup the logical decoding. Not sure though if it's necessary to do
+ * it even earlier.
+ */
+ cat_state = get_catalog_state(OldHeap);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Unlock the relation (and possibly the clustering index) to avoid
+ * deadlock because setup_logical_decoding() will wait for all the
+ * running transactions (with XID assigned) to finish. Some of those
+ * transactions might be waiting for a lock on our relation.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ if (index)
+ init_rel_reopen_info(&rri[nrel++], &index, InvalidOid,
+ ShareUpdateExclusiveLock,
+ ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+
+ /* Prepare to capture the concurrent data changes. */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ /* Lock the table (and index) again. */
+ reopen_relations(rri, nrel);
+
+ /*
+ * Check if a 'tupdesc' could have changed while the relation was
+ * unlocked.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ }
if (index)
/* Mark the correct index as clustered */
@@ -629,7 +1034,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -645,30 +1049,51 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
- &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
+ cmd, &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ cat_state, ctx,
+ swap_toast_by_content,
+ frozenXid, cutoffMulti);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ free_catalog_state(cat_state);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -803,14 +1228,18 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- ClusterCommand cmd, bool *pSwapToastByContent,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, ClusterCommand cmd, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
@@ -829,6 +1258,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -855,8 +1285,12 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*
* We don't need to open the toast relation here, just lock it. The lock
* will be held till end of transaction.
+ *
+ * In the REPACK CONCURRENTLY case, the lock does not help because we need
+ * to release it temporarily at some point. Instead, we expect VACUUM /
+ * CLUSTER to skip tables which are present in RepackedRelsHash.
*/
- if (OldHeap->rd_rel->reltoastrelid)
+ if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
/*
@@ -932,8 +1366,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -965,7 +1439,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -974,7 +1450,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1432,14 +1912,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1465,39 +1944,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1809,6 +2296,1878 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/* Maximum number of entries in the hashtable. */
+static int maxRepackedRels = 0;
+
+Size
+RepackShmemSize(void)
+{
+ /*
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+ maxRepackedRels = max_replication_slots;
+
+ return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+
+ RepackedRelsHash = ShmemInitHash("Repacked Relations",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that on various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Since we need to close and reopen the relation here, the 'rel_p' and
+ * 'index_p' arguments are in/out.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into the hashtable or not.
+ */
+static void
+begin_concurrent_repack(Relation *rel_p, Relation *index_p,
+ bool *entered_p)
+{
+ Relation rel = *rel_p;
+ Oid relid,
+ toastrelid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ RelReopenInfo rri[2];
+ int nrel;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+
+ /*
+ * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+
+ /*
+ * If we could enter the main fork the TOAST should succeed too.
+ * Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry.
+ *
+ * Besides sending the invalidation message, we need to force re-opening
+ * of the relation, which includes the actual invalidation (and thus
+ * checking of our hashtable on the next access).
+ */
+ CacheInvalidateRelcacheImmediate(rel);
+
+ /*
+ * Since the hashtable only needs to be checked by write transactions,
+ * lock the relation in a mode that conflicts with any DML command. (The
+ * reading transactions are supposed to close the relation before opening
+ * it with higher lock.) Once we have the relation (and its index) locked,
+ * we unlock it immediately and then re-lock using the original mode.
+ */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ if (index_p)
+ {
+ /*
+ * Another transaction might want to open both the relation and the
+ * index. If it already has the relation lock and is waiting for the
+ * index lock, we should release the index lock, otherwise our request
+ * for ShareLock on the relation can end up in a deadlock.
+ */
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareUpdateExclusiveLock, ShareLock);
+ }
+ unlock_and_close_relations(rri, nrel);
+
+ /*
+ * XXX It's not strictly necessary to lock the index here, but it's
+ * probably not worth teaching the "reopen API" about this special case.
+ */
+ reopen_relations(rri, nrel);
+
+ /* Switch back to the original lock. */
+ nrel = 0;
+ init_rel_reopen_info(&rri[nrel++], rel_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ if (index_p)
+ init_rel_reopen_info(&rri[nrel++], index_p, InvalidOid,
+ ShareLock, ShareUpdateExclusiveLock);
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+ /* Make sure the reopened relcache entry is used, not the old one. */
+ rel = *rel_p;
+
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ if (OidIsValid(toastrelid))
+ {
+ Relation toastrel;
+
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
+ }
+}
+
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
+ */
+static void
+end_concurrent_repack(bool error)
+{
+ RepackedRel key;
+ RepackedRel *entry = NULL,
+ *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ }
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+
+ repacked_rel_toast = InvalidOid;
+ }
+ LWLockRelease(RepackedRelsLock);
+
+ /* Restore normal function of logical decoding. */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+ }
+
+ /*
+ * Note: unlike begin_concurrent_repack(), here we do not lock/unlock the
+ * relation: 1) On normal completion, the caller is already holding
+ * AccessExclusiveLock (till the end of the transaction), 2) on ERROR /
+ * FATAL, we try to do the cleanup asap, but the worst case is that other
+ * backends will write unnecessary information to WAL until they close the
+ * relation.
+ */
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel) || OidIsValid(repacked_rel_toast))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Check if REPACK CONCURRENTLY is already running for given relation, and if
+ * so, raise ERROR. The problem is that cluster_rel() needs to release its
+ * lock on the relation temporarily at some point, so our lock alone does not
+ * help. Commands that might break what cluster_rel() is doing should call
+ * this function first.
+ *
+ * Return without checking if lockmode allows for race conditions which would
+ * make the result meaningless. In that case, cluster_rel() itself should
+ * throw ERROR if the relation was changed by us in an incompatible
+ * way. However, if it managed to do most of its work by then, a lot of CPU
+ * time might be wasted.
+ */
+void
+check_for_concurrent_repack(Oid relid, LOCKMODE lockmode)
+{
+ /*
+ * If the caller does not have a lock that conflicts with
+ * ShareUpdateExclusiveLock, the check makes little sense because REPACK
+ * CONCURRENTLY can start anytime after the check.
+ */
+ if (lockmode < ShareUpdateExclusiveLock)
+ return;
+
+ /*
+ * The caller has a lock which conflicts with REPACK CONCURRENTLY, so if
+ * that's not running now, it cannot start until the caller's transaction
+ * has completed.
+ */
+ if (is_concurrent_repack_in_progress(relid))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg(REPACK_CONCURRENT_IN_PROGRESS_MSG,
+ get_rel_name(relid))));
+
+}
+
+/*
+ * Check if relation is eligible for REPACK CONCURRENTLY and retrieve the
+ * catalog state to be passed later to check_catalog_changes.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static CatalogState *
+get_catalog_state(Relation rel)
+{
+ CatalogState *result = palloc_object(CatalogState);
+ List *ind_oids;
+ ListCell *lc;
+ int ninds,
+ i;
+ char relpersistence = rel->rd_rel->relpersistence;
+ char replident = rel->rd_rel->relreplident;
+ Oid ident_idx = RelationGetReplicaIndex(rel);
+ TupleDesc td_src = RelationGetDescr(rel);
+
+ /*
+ * While gathering the catalog information, check if there is a reason not
+ * to proceed.
+ *
+ * This function was already called, but the relation was unlocked since
+ * (see begin_concurrent_repack()). check_catalog_changes() should catch
+ * any "disruptive" changes in the future.
+ */
+ can_repack_concurrently(rel);
+
+ /* No index should be dropped while we are checking it. */
+ Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ result->ninds = ninds = list_length(ind_oids);
+ result->ind_oids = palloc_array(Oid, ninds);
+ result->ind_tupdescs = palloc_array(TupleDesc, ninds);
+ i = 0;
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ Relation index;
+ TupleDesc td_ind_src,
+ td_ind_dst;
+
+ /*
+ * Weaker lock should be o.k. for the index, but this one should not
+ * break anything either.
+ */
+ index = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+ result->ind_oids[i] = RelationGetRelid(index);
+ td_ind_src = RelationGetDescr(index);
+ td_ind_dst = palloc(TupleDescSize(td_ind_src));
+ TupleDescCopy(td_ind_dst, td_ind_src);
+ result->ind_tupdescs[i] = td_ind_dst;
+ i++;
+
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+
+ /* Fill-in the relation info. */
+ result->tupdesc = palloc(TupleDescSize(td_src));
+ TupleDescCopy(result->tupdesc, td_src);
+ result->relpersistence = relpersistence;
+ result->replident = replident;
+ result->replidindex = ident_idx;
+
+ return result;
+}
+
+static void
+free_catalog_state(CatalogState *state)
+{
+ /* We are only interested in indexes. */
+ if (state->ninds == 0)
+ return;
+
+ for (int i = 0; i < state->ninds; i++)
+ FreeTupleDesc(state->ind_tupdescs[i]);
+
+ FreeTupleDesc(state->tupdesc);
+ pfree(state->ind_oids);
+ pfree(state->ind_tupdescs);
+ pfree(state);
+}
+
+/*
+ * Raise ERROR if 'rel' changed in a way that does not allow further
+ * processing of REPACK CONCURRENTLY.
+ *
+ * Besides the relation's tuple descriptor, it's important to check indexes:
+ * concurrent change of index definition (can it happen in other way than
+ * dropping and re-creating the index, accidentally with the same OID?) can be
+ * a problem because we may already have the new index built. If an index was
+ * created or dropped concurrently, we'd fail to swap the index storage. In
+ * any case, we prefer to check the indexes early to get an explicit error
+ * message about the mismatch. Furthermore, the earlier we detect the change,
+ * the fewer CPU cycles we waste.
+ *
+ * Note that we do not check constraints because the transaction which changed
+ * them must have ensured that the existing tuples satisfy the new
+ * constraints. If any DML commands were necessary for that, we will simply
+ * decode them from WAL and apply them to the new storage.
+ *
+ * Caller is supposed to hold (at least) ShareUpdateExclusiveLock on the
+ * relation.
+ */
+static void
+check_catalog_changes(Relation rel, CatalogState *cat_state)
+{
+ Oid reltoastrelid = rel->rd_rel->reltoastrelid;
+ List *ind_oids;
+ ListCell *lc;
+ LOCKMODE lockmode;
+ Oid ident_idx;
+ TupleDesc td,
+ td_cp;
+
+ /* First, check the relation info. */
+
+ /* TOAST is not easy to change, but check. */
+ if (reltoastrelid != repacked_rel_toast)
+ ereport(ERROR,
+ errmsg("TOAST relation of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * Likewise, check_for_concurrent_repack() should prevent others from
+ * changing the relation file concurrently, but it's our responsibility to
+ * avoid data loss. (The original locators are stored outside cat_state,
+ * but the check belongs to this function.)
+ */
+ if (!RelFileLocatorEquals(rel->rd_locator, repacked_rel_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastrel;
+
+ toastrel = table_open(reltoastrelid, AccessShareLock);
+ if (!RelFileLocatorEquals(toastrel->rd_locator,
+ repacked_rel_toast_locator))
+ ereport(ERROR,
+ (errmsg("file of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(toastrel))));
+ table_close(toastrel, AccessShareLock);
+ }
+
+ if (rel->rd_rel->relpersistence != cat_state->relpersistence)
+ ereport(ERROR,
+ errmsg("persistence of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ if (cat_state->replident != rel->rd_rel->relreplident)
+ ereport(ERROR,
+ errmsg("replica identity of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (ident_idx == InvalidOid && rel->rd_pkindex != InvalidOid)
+ ident_idx = rel->rd_pkindex;
+ if (cat_state->replidindex != ident_idx)
+ ereport(ERROR,
+ errmsg("identity index of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+
+ /*
+ * As cat_state contains a copy (which has the constraint info cleared),
+ * create a temporary copy for the comparison.
+ */
+ td = RelationGetDescr(rel);
+ td_cp = palloc(TupleDescSize(td));
+ TupleDescCopy(td_cp, td);
+ if (!equalTupleDescs(cat_state->tupdesc, td_cp))
+ ereport(ERROR,
+ errmsg("definition of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel)));
+ FreeTupleDesc(td_cp);
+
+ /* Now we are only interested in indexes. */
+ if (cat_state->ninds == 0)
+ return;
+
+ /* No index should be dropped while we are checking the relation. */
+ lockmode = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(rel, lockmode, true));
+
+ ind_oids = RelationGetIndexList(rel);
+ if (list_length(ind_oids) != cat_state->ninds)
+ goto failed_index;
+
+ foreach(lc, ind_oids)
+ {
+ Oid ind_oid = lfirst_oid(lc);
+ int i;
+ TupleDesc tupdesc;
+ Relation index;
+
+ /* Find the index in cat_state. */
+ for (i = 0; i < cat_state->ninds; i++)
+ {
+ if (cat_state->ind_oids[i] == ind_oid)
+ break;
+ }
+
+ /*
+ * OID not found, i.e. the index was replaced by another one. XXX
+ * Should we yet try to find if an index having the desired tuple
+ * descriptor exists? Or should we always look for the tuple
+ * descriptor and not use OIDs at all?
+ */
+ if (i == cat_state->ninds)
+ goto failed_index;
+
+ /* Check the tuple descriptor. */
+ index = try_index_open(ind_oid, lockmode);
+ if (index == NULL)
+ goto failed_index;
+ tupdesc = RelationGetDescr(index);
+ if (!equalTupleDescs(cat_state->ind_tupdescs[i], tupdesc))
+ goto failed_index;
+ index_close(index, lockmode);
+ }
+
+ return;
+
+failed_index:
+ ereport(ERROR,
+ (errmsg("index(es) of relation \"%s\" changed by another transaction",
+ RelationGetRelationName(rel))));
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
+
+ /*
+ * Check if we can use logical decoding.
+ */
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
+
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
+
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
+
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+ PgBackendProgress progress;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ /*
+ * reorderbuffer.c uses internal subtransaction, whose abort ends the
+ * command progress reporting. Save the status here so we can restore when
+ * done with the decoding.
+ */
+ memcpy(&progress, &MyBEEntry->st_progress, sizeof(PgBackendProgress));
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
+ {
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
+
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ /* Restore the progress reporting status. */
+ pgstat_progress_restore_state(&progress);
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+ iistate->econtext->ecxt_scantuple = index_slot;
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+ result->econtext = GetPerTupleExprContext(estate);
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ CatalogState *cat_state,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ RelReopenInfo *rri = NULL;
+ int nrel;
+ Relation *ind_refs_all,
+ *ind_refs_p;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock
+ * creation - no need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Release the locks that allowed concurrent data changes, in order to
+ * acquire the AccessExclusiveLock.
+ */
+ nrel = 0;
+
+ /*
+ * We unlock the old relation (and its clustering index), but then we will
+ * lock the relation and *all* its indexes because we want to swap their
+ * storage.
+ *
+ * (NewHeap is already locked, as well as its indexes.)
+ */
+ rri = palloc_array(RelReopenInfo, 1 + list_length(ind_oids_old));
+ init_rel_reopen_info(&rri[nrel++], &OldHeap, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ /* References to the re-opened indexes will be stored in this array. */
+ ind_refs_all = palloc_array(Relation, list_length(ind_oids_old));
+ ind_refs_p = ind_refs_all;
+ /* The clustering index is a special case. */
+ if (cl_index)
+ {
+ *ind_refs_p = cl_index;
+ init_rel_reopen_info(&rri[nrel], ind_refs_p, InvalidOid,
+ ShareUpdateExclusiveLock, AccessExclusiveLock);
+ nrel++;
+ ind_refs_p++;
+ }
+
+ /*
+ * Initialize also the entries for the other indexes (currently unlocked)
+ * because we will have to lock them.
+ */
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+
+ ind_oid = lfirst_oid(lc);
+ /* Clustering index is already in the array, or there is none. */
+ if (cl_index && RelationGetRelid(cl_index) == ind_oid)
+ continue;
+
+ Assert(nrel < (1 + list_length(ind_oids_old)));
+
+ *ind_refs_p = NULL;
+ init_rel_reopen_info(&rri[nrel],
+
+ /*
+ * In this special case we do not have the relcache reference, use OID
+ * instead.
+ */
+ ind_refs_p,
+ ind_oid,
+ NoLock, /* Nothing to unlock. */
+ AccessExclusiveLock);
+
+ nrel++;
+ ind_refs_p++;
+ }
+ /* Perform the actual unlocking and re-locking. */
+ unlock_and_close_relations(rri, nrel);
+ reopen_relations(rri, nrel);
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation that we skipped for the
+ * CONCURRENTLY option in copy_table_data(). This lock will be needed to
+ * swap the relation files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Check if the new indexes match the old ones, i.e. no changes occurred
+ * while OldHeap was unlocked.
+ *
+ * XXX It's probably not necessary to check the relation tuple descriptor
+ * here because the logical decoding was already active when we released
+ * the lock, and thus the corresponding data changes won't be lost.
+ * However processing of those changes might take a lot of time.
+ */
+ check_catalog_changes(OldHeap, cat_state);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < (nrel - 1); i++)
+ {
+ Relation index = ind_refs_all[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs_all);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+
+ pfree(rri);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+static void
+init_rel_reopen_info(RelReopenInfo *rri, Relation *rel_p, Oid relid,
+ LOCKMODE lockmode_orig, LOCKMODE lockmode_new)
+{
+ rri->rel_p = rel_p;
+ rri->relid = relid;
+ rri->lockmode_orig = lockmode_orig;
+ rri->lockmode_new = lockmode_new;
+}
+
+/*
+ * Unlock and close relations specified by items of the 'rels' array. 'nrels'
+ * is the number of items.
+ *
+ * Information needed to (re)open the relations (or to issue meaningful ERROR)
+ * is added to the array items.
+ */
+static void
+unlock_and_close_relations(RelReopenInfo *rels, int nrel)
+{
+ int i;
+ RelReopenInfo *rri;
+
+ /*
+ * First, retrieve the information that we will need for re-opening.
+ *
+ * We could close (and unlock) each relation as soon as we have gathered
+ * the related information, but then we would have to be careful not to
+ * unlock the table until we have the info on all its indexes. (Once we
+ * unlock the table, any index can be dropped, and thus we can fail to get
+ * the name we want to report if re-opening fails.) It seem simpler to
+ * separate the work into two iterations.
+ */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ if (rel)
+ {
+ Assert(CheckRelationLockedByMe(rel, rri->lockmode_orig, false));
+ Assert(!OidIsValid(rri->relid));
+
+ rri->relid = RelationGetRelid(rel);
+ rri->relkind = rel->rd_rel->relkind;
+ rri->relname = pstrdup(RelationGetRelationName(rel));
+ }
+ else
+ {
+ Assert(OidIsValid(rri->relid));
+
+ rri->relname = get_rel_name(rri->relid);
+ rri->relkind = get_rel_relkind(rri->relid);
+ }
+ }
+
+ /* Second, close the relations. */
+ for (i = 0; i < nrel; i++)
+ {
+ Relation rel;
+
+ rri = &rels[i];
+ rel = *rri->rel_p;
+
+ /* Close the relation if the caller passed one. */
+ if (rel)
+ {
+ if (rri->relkind == RELKIND_RELATION)
+ table_close(rel, rri->lockmode_orig);
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ index_close(rel, rri->lockmode_orig);
+ }
+ }
+ }
+}
+
+/*
+ * Re-open the relations closed previously by unlock_and_close_relations().
+ */
+static void
+reopen_relations(RelReopenInfo *rels, int nrel)
+{
+ for (int i = 0; i < nrel; i++)
+ {
+ RelReopenInfo *rri = &rels[i];
+ Relation rel;
+
+ if (rri->relkind == RELKIND_RELATION)
+ {
+ rel = try_table_open(rri->relid, rri->lockmode_new);
+ }
+ else
+ {
+ Assert(rri->relkind == RELKIND_INDEX);
+
+ rel = try_index_open(rri->relid, rri->lockmode_new);
+ }
+
+ if (rel == NULL)
+ {
+ const char *kind_str;
+
+ kind_str = (rri->relkind == RELKIND_RELATION) ? "table" : "index";
+ ereport(ERROR,
+ (errmsg("could not open \%s \"%s\"", kind_str,
+ rri->relname),
+ errhint("The %s could have been dropped by another transaction.",
+ kind_str)));
+ }
+ *rri->rel_p = rel;
+
+ pfree(rri->relname);
+ }
+}
+
/*
* REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
*/
@@ -1822,6 +4181,7 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
Oid indexOid = InvalidOid;
MemoryContext repack_context;
List *rtcs;
+ LOCKMODE lockmode;
/* Parse option list */
foreach(lc, stmt->params)
@@ -1838,22 +4198,55 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
parser_errposition(pstate, opt->location)));
}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_REPACK, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel, ¶ms, &indexOid);
if (rel == NULL)
return;
}
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
*/
PreventInTransactionBlock(isTopLevel, "REPACK");
@@ -1869,6 +4262,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
bool rel_is_index;
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
if (OidIsValid(indexOid))
{
@@ -1885,13 +4280,15 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
CLUSTER_COMMAND_REPACK);
/* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ table_close(rel, lockmode);
}
else
rtcs = get_tables_to_repack(repack_context);
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
+
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1909,7 +4306,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd, ClusterParams *params,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel, ClusterParams *params,
Oid *indexOid_p)
{
Relation rel;
@@ -1919,12 +4317,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -1978,7 +4374,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e7854add178..df879c2a18d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e59ea0468c2..6f63ee85163 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4530,6 +4530,16 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
rel = relation_open(relid, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENTLY is in progress. If
+ * lockmode is too weak, cluster_rel() should detect incompatible DDLs
+ * executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not change the tuple
+ * descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
EventTriggerAlterTableRelid(relid);
ATController(NULL, rel, cmds, recurse, lockmode, NULL);
@@ -5963,6 +5973,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 61018482089..6e914a7020a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1996,7 +1997,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2264,7 +2265,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2310,7 +2311,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d53808a406e..ea7ad798450 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11902,27 +11902,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK qualified_name repack_index_specification
+ REPACK opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2;
- n->indexname = $3;
+ n->concurrent = $2;
+ n->relation = $3;
+ n->indexname = $4;
n->params = NIL;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ | REPACK '(' utility_option_list ')' opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5;
- n->indexname = $6;
n->params = $3;
+ n->concurrent = $5;
+ n->relation = $6;
+ n->indexname = $7;
$$ = (Node *) n;
}
@@ -11933,6 +11936,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
+ n->concurrent = false;
$$ = (Node *) n;
}
@@ -11943,6 +11947,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
+ n->concurrent = false;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..00f7bbc5f59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
+ * only decode data changes of the table that it is processing, and the
+ * changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e5d2a583ce6..c32e459411b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index bf3ba3c2ae7..4ee4c474874 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1307,6 +1307,16 @@ ProcessUtilitySlow(ParseState *pstate,
lockmode = AlterTableGetLockLevel(atstmt->cmds);
relid = AlterTableLookupRelation(atstmt, lockmode);
+ /*
+ * If lockmode allows, check if REPACK CONCURRENT is in
+ * progress. If lockmode is too weak, cluster_rel() should
+ * detect incompatible DDLs executed by us.
+ *
+ * XXX We might skip the changes for DDLs which do not
+ * change the tuple descriptor.
+ */
+ check_for_concurrent_repack(relid, lockmode);
+
if (OidIsValid(relid))
{
AlterTableUtilityContext atcontext;
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index 17b5d87446b..fcd5d396b21 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -163,3 +163,19 @@ pgstat_progress_end_command(void)
beentry->st_progress.p_command_target = InvalidOid;
PGSTAT_END_WRITE_ACTIVITY(beentry);
}
+
+void
+pgstat_progress_restore_state(PgBackendProgress *backup)
+{
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry || !pgstat_track_activities)
+ return;
+
+ PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+ beentry->st_progress.p_command = backup->p_command;
+ beentry->st_progress.p_command_target = backup->p_command_target;
+ memcpy(MyBEEntry->st_progress.p_param, backup->p_param,
+ sizeof(beentry->st_progress.p_param));
+ PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..ef04ed32cab 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -349,6 +349,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 4eb67720737..2f25cd86fe0 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1633,6 +1633,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Relation relation)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = RelationGetRelid(relation);
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9f54a9e72b7..679cc6be1d1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
@@ -1252,6 +1253,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 31271786f21..a22e6cb6ccc 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4914,18 +4914,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..bdeb2f83540 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -421,6 +421,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..b1ca73d6ea5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1637,6 +1640,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1649,6 +1656,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1657,6 +1666,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c2976905e4d..a2589d60d6e 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,14 +52,91 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void can_repack_concurrently(Relation rel);
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -61,9 +144,15 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7644267e14f..6b1b1a4c1a7 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -67,10 +67,12 @@
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 8
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_REPACK */
#define PROGRESS_REPACK_COMMAND_REPACK 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d32a4d9f2db..e36a32b83b2 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3926,6 +3926,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 932024b1b0b..fe9d85e5f95 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 10aaec9b15c..47ff1aa0f3f 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -55,6 +55,7 @@ extern void pgstat_progress_parallel_incr_param(int index, int64 incr);
extern void pgstat_progress_update_multi_param(int nparam, const int *index,
const int64 *val);
extern void pgstat_progress_end_command(void);
+extern void pgstat_progress_restore_state(PgBackendProgress *backup);
#endif /* BACKEND_PROGRESS_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..3409f942098 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Relation relation);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d94fddd7cef..372065fc570 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -692,7 +695,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_repack_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84ca2dc3778..086c61f4ef4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1969,17 +1969,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2055,17 +2055,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 01246732456..ac52b8b0336 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -400,6 +400,7 @@ CatCacheHeader
CatalogId
CatalogIdMapEntry
CatalogIndexState
+CatalogState
ChangeVarNodes_context
CheckPoint
CheckPointStmt
@@ -476,6 +477,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1235,6 +1238,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2451,6 +2455,7 @@ RelMapping
RelOptInfo
RelOptKind
RelPathStr
+RelReopenInfo
RelStatsInfo
RelToCheck
RelToCluster
@@ -2502,6 +2507,8 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.39.5
v10-0005-Preserve-visibility-information-of-the-concurren.patchtext/x-diff; charset=utf-8Download
From d1a9474dda635733a8494f479c6f6eda898de217 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:12:41 +0100
Subject: [PATCH v10 5/9] Preserve visibility information of the concurrent
data changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.
However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". Now the tuples
written into the new table storage have the same XID and command ID (CID) as
they had in the old storage.
Related change we do here is that the data changes (INSERT, UPDATE, DELETE) we
"replay" on the new storage are not logically decoded. First, the logical
decoding subsystem does not expect that already committed transaction is
decoded again. Second, repeated decoding would be just wasted effort.
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 73 ++++++++----
src/backend/access/heap/heapam_handler.c | 14 ++-
src/backend/access/transam/xact.c | 52 ++++++++
src/backend/commands/cluster.c | 111 ++++++++++++++++--
src/backend/replication/logical/decode.c | 77 ++++++++++--
src/backend/replication/logical/snapbuild.c | 22 ++--
.../pgoutput_repack/pgoutput_repack.c | 68 +++++++++--
src/include/access/heapam.h | 15 ++-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 18 +++
src/include/utils/snapshot.h | 3 +
13 files changed, 390 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346ce..75d889ec72c 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1be1ef22d1e..bf211426682 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2070,7 +2071,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -2086,15 +2087,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2725,7 +2727,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2782,11 +2785,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2803,6 +2806,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -3028,7 +3032,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3096,8 +3101,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3118,6 +3127,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3207,10 +3225,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */ );
switch (result)
{
case TM_SelfModified:
@@ -3249,12 +3268,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3294,6 +3312,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4131,8 +4150,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -4142,7 +4165,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4497,10 +4521,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8833,7 +8857,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8844,10 +8869,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 371afa6ad59..ea1d6f299b3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -256,7 +256,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -279,7 +280,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -313,7 +315,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -331,8 +334,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..aebad612ce8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -126,6 +126,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -972,6 +984,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nRepackCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -991,6 +1005,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nRepackCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ RepackCurrentXids,
+ nRepackCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5640,6 +5669,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetRepackCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+ RepackCurrentXids = xip;
+ nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ * Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+ RepackCurrentXids = NULL;
+ nRepackCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 90e43f12417..4c15e3e3133 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -210,6 +210,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -2983,6 +2984,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
/* Initialize the descriptor to store the changes ... */
@@ -3140,6 +3144,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
char *change_raw,
*src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -3208,8 +3213,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being CLUSTERed concurrently is considered an
+ * "user catalog", new CID is WAL-logged and decoded. And since we
+ * use the same XID that the original DMLs did, the snapshot used
+ * for the logical decoding (by now converted to a non-historic
+ * MVCC snapshot) should see the tuples inserted previously into
+ * the new heap and/or updated there.
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -3220,6 +3247,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetRepackCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -3232,11 +3261,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -3256,10 +3288,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
- heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -3267,6 +3319,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -3277,6 +3330,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetRepackCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -3294,18 +3349,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -3315,6 +3388,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -3325,7 +3399,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
}
@@ -3343,7 +3432,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -3352,7 +3441,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
HeapTuple result = NULL;
/* XXX no instrumentation for now */
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
NULL, nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -3424,6 +3513,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetRepackCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 00f7bbc5f59..0b1603cd577 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
- * only decode data changes of the table that it is processing, and the
- * changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by CLUSTER CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * subsystem probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,61 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The backend executing CLUSTER CONCURRENTLY should not return here
+ * because the records which passed the checks above should contain be
+ * eligible for decoding. However, CLUSTER CONCURRENTLY generates WAL when
+ * writing data into the new table, which should not be decoded by the
+ * other backends. This is where the other backends skip them.
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * (Besides insertion into the main heap by CLUSTER
+ * CONCURRENTLY, this does happen when raw_heap_insert marks
+ * the TOAST record as HEAP_INSERT_NO_LOGICAL).
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +987,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c32e459411b..fde4955c328 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 687fbbc59bb..28bd16f9cc7 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
RepackDecodingState *dstate;
+ Snapshot snapshot;
dstate = (RepackDecodingState *) ctx->output_writer_private;
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -266,6 +312,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -279,6 +330,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bdeb2f83540..b0c6f1d916f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -325,21 +325,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..fbb66d559b6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index a2589d60d6e..cad10a02bd0 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -73,6 +73,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -104,6 +112,8 @@ typedef struct RepackDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -124,6 +134,14 @@ typedef struct RepackDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} RepackDecodingState;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..014f27db7d7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
--
2.39.5
v10-0006-Add-regression-tests.patchtext/x-diff; charset=utf-8Download
From 1ecb7f4220a89672cd944f186e9010f9c2b1aca4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:14:58 +0100
Subject: [PATCH v10 6/9] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 140 ++++++++++++++++++
6 files changed, 267 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4c15e3e3133..fa455e99d4a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3741,6 +3742,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..49a736ed617
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..5aa8983f98d
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.39.5
v10-0007-Introduce-repack_max_xlock_time-configuration-va.patchtext/x-diff; charset=utf-8Download
From 462720a8860aa6c27676b42586dad4e79d84a987 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Tue, 25 Mar 2025 13:42:59 +0100
Subject: [PATCH v10 7/9] Introduce repack_max_xlock_time configuration
variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 294 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 69fc93dffc4..4ab14939387 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11205,6 +11205,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9ee640e3517..0c250689d13 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -188,7 +188,14 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ea1d6f299b3..850708c7830 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1008,7 +1008,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index fa455e99d4a..c272ed03cb9 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -109,6 +111,15 @@ RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
#define REPACK_CONCURRENT_IN_PROGRESS_MSG \
"relation \"%s\" is already being processed by REPACK CONCURRENTLY"
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -198,7 +209,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -215,13 +227,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -3039,7 +3053,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -3077,6 +3092,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -3119,7 +3137,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -3151,6 +3170,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -3275,10 +3297,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3482,11 +3516,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -3495,10 +3533,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3510,7 +3557,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3520,6 +3567,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3684,6 +3753,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
int nrel;
Relation *ind_refs_all,
*ind_refs_p;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3764,7 +3835,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -3889,9 +3961,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 989825d3a9c..4e695246fc5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,8 +39,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2837,6 +2838,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0b9e3066bde..bb256414142 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -763,6 +763,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index cad10a02bd0..268c3098512 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -154,7 +156,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void can_repack_concurrently(Relation rel);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index 49a736ed617..f2728d94222 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 5aa8983f98d..0f45f9d2544 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.39.5
v10-0008-Enable-logical-decoding-transiently-only-for-REP.patchtext/x-diff; charset=utf-8Download
From 1946221703c271d8fdae0d5cd43874d59abb5ef2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:17:45 +0100
Subject: [PATCH v10 8/9] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
src/backend/access/transam/parallel.c | 8 ++
src/backend/access/transam/xact.c | 106 ++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 107 ++++++++++++++----
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/standby.c | 4 +-
src/include/access/xlog.h | 15 ++-
src/include/commands/cluster.h | 1 +
src/include/utils/rel.h | 6 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
13 files changed, 217 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index aebad612ce8..1b02ef0bacb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -138,6 +139,12 @@ static TransactionId *ParallelCurrentXids;
static int nRepackCurrentXids = 0;
static TransactionId *RepackCurrentXids = NULL;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -649,6 +656,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -663,6 +671,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -692,20 +726,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -730,6 +750,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2244,6 +2312,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b6c694a3f7..1b131e1436f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index c272ed03cb9..2ab1756fa19 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1304,7 +1304,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
*
* In the REPACK CONCURRENTLY case, the lock does not help because we need
* to release it temporarily at some point. Instead, we expect VACUUM /
- * CLUSTER to skip tables which are present in RepackedRelsHash.
+ * CLUSTER to skip tables which are present in repackedRels->hashtable.
*/
if (OldHeap->rd_rel->reltoastrelid && !concurrent)
LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
@@ -2324,7 +2324,16 @@ typedef struct RepackedRel
Oid dbid;
} RepackedRel;
-static HTAB *RepackedRelsHash = NULL;
+typedef struct RepackedRels
+{
+ /* Hashtable of RepackedRel elements. */
+ HTAB *hashtable;
+
+ /* The number of elements in the hashtable.. */
+ pg_atomic_uint32 nrels;
+} RepackedRels;
+
+static RepackedRels *repackedRels = NULL;
/* Maximum number of entries in the hashtable. */
static int maxRepackedRels = 0;
@@ -2332,28 +2341,44 @@ static int maxRepackedRels = 0;
Size
RepackShmemSize(void)
{
+ Size result;
+
+ result = sizeof(RepackedRels);
+
/*
* A replication slot is needed for the processing, so use this GUC to
* allocate memory for the hashtable.
*/
maxRepackedRels = max_replication_slots;
- return hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ result += hash_estimate_size(maxRepackedRels, sizeof(RepackedRel));
+ return result;
}
void
RepackShmemInit(void)
{
+ bool found;
HASHCTL info;
+ repackedRels = ShmemInitStruct("Repacked Relations",
+ sizeof(RepackedRels),
+ &found);
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&repackedRels->nrels, 0);
+ }
+ else
+ Assert(found);
+
info.keysize = sizeof(RepackedRel);
info.entrysize = info.keysize;
-
- RepackedRelsHash = ShmemInitHash("Repacked Relations",
- maxRepackedRels,
- maxRepackedRels,
- &info,
- HASH_ELEM | HASH_BLOBS);
+ repackedRels->hashtable = ShmemInitHash("Repacked Relations Hash",
+ maxRepackedRels,
+ maxRepackedRels,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
}
/*
@@ -2387,12 +2412,13 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
RelReopenInfo rri[2];
int nrel;
static bool before_shmem_exit_callback_setup = false;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
relid = RelationGetRelid(rel);
/*
- * Make sure that we do not leave an entry in RepackedRelsHash if exiting
- * due to FATAL.
+ * Make sure that we do not leave an entry in repackedRels->Hashtable if
+ * exiting due to FATAL.
*/
if (!before_shmem_exit_callback_setup)
{
@@ -2407,7 +2433,7 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
*entered_p = false;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL, &found);
if (found)
{
/*
@@ -2425,6 +2451,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
/*
* Even if the insertion of TOAST relid should fail below, the caller has
* to do cleanup.
@@ -2452,7 +2482,8 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
{
key.relid = toastrelid;
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL,
+ &found);
if (found)
/*
@@ -2467,6 +2498,10 @@ begin_concurrent_repack(Relation *rel_p, Relation *index_p,
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < maxRepackedRels);
+
Assert(!OidIsValid(repacked_rel_toast));
repacked_rel_toast = toastrelid;
}
@@ -2549,6 +2584,7 @@ end_concurrent_repack(bool error)
*entry_toast = NULL;
Oid relid = repacked_rel;
Oid toastrelid = repacked_rel_toast;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
/* Remove the relation from the hash if we managed to insert one. */
if (OidIsValid(repacked_rel))
@@ -2557,23 +2593,32 @@ end_concurrent_repack(bool error)
key.relid = repacked_rel;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
- entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ entry = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
+ NULL);
/*
* By clearing this variable we also disable
* cluster_before_shmem_exit_callback().
*/
repacked_rel = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
/* Remove the TOAST relation if there is one. */
if (OidIsValid(repacked_rel_toast))
{
key.relid = repacked_rel_toast;
- entry_toast = hash_search(RepackedRelsHash, &key, HASH_REMOVE,
+ entry_toast = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
NULL);
repacked_rel_toast = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
LWLockRelease(RepackedRelsLock);
@@ -2639,7 +2684,7 @@ end_concurrent_repack(bool error)
}
/*
- * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ * A wrapper to call end_concurrent_cluster() as a before_shmem_exit callback.
*/
static void
cluster_before_shmem_exit_callback(int code, Datum arg)
@@ -2650,6 +2695,8 @@ cluster_before_shmem_exit_callback(int code, Datum arg)
/*
* Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
*/
bool
is_concurrent_repack_in_progress(Oid relid)
@@ -2657,18 +2704,40 @@ is_concurrent_repack_in_progress(Oid relid)
RepackedRel key,
*entry;
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just use the counter.
+ */
+ if (!OidIsValid(relid))
+ {
+ if (pg_atomic_read_u32(&repackedRels->nrels) > 0)
+ return true;
+ else
+ return false;
+ }
+
+ /* For particular relation we need to search in the hashtable. */
memset(&key, 0, sizeof(key));
key.relid = relid;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_SHARED);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ hash_search(repackedRels->hashtable, &key, HASH_FIND, NULL);
LWLockRelease(RepackedRelsLock);
return entry != NULL;
}
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
+}
+
/*
* Check if REPACK CONCURRENTLY is already running for given relation, and if
* so, raise ERROR. The problem is that cluster_rel() needs to release its
@@ -2967,8 +3036,8 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
- * regarding logical decoding, treated like a catalog.
+ * table we are processing is present in repackedRels->hashtable and
+ * therefore, regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
NIL,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..e5790d3fe84 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..413bcc1addb 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1313,13 +1313,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 268c3098512..6f5566210a8 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -173,6 +173,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
extern Size RepackShmemSize(void);
extern void RepackShmemInit(void);
extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
extern void check_for_concurrent_repack(Oid relid, LOCKMODE lockmode);
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 372065fc570..fcbad5c1720 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -710,12 +710,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ac52b8b0336..de1a178a4b9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2508,6 +2508,7 @@ ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
RepackedRel
+RepackedRels
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.39.5
v10-0009-Call-logical_rewrite_heap_tuple-when-applying-co.patchtext/x-diff; charset=utf-8Download
From 87af20748146ceb255289c5f540711c8c28269f1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@alvh.no-ip.org>
Date: Mon, 24 Mar 2025 20:18:39 +0100
Subject: [PATCH v10 9/9] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. REPACK CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by REPACK, this
command must record the information on individual tuples being moved from the
old file to the new one. This is what logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. REPACK CONCURRENTLY is processing a relation, but once it has released all
the locks (in order to get the exclusive lock), another backend runs REPACK
CONCURRENTLY on the same table. Since the relation is treated as a system
catalog while these commands are processing it (so it can be scanned using a
historic snapshot during the "initial load"), it is important that the 2nd
backend does not break decoding of the "combo CIDs" performed by the 1st
backend.
However, it's not practical to let multiple backends run REPACK CONCURRENTLY
on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 59 +++++----
src/backend/commands/cluster.c | 113 +++++++++++++++---
src/backend/replication/logical/decode.c | 42 ++++++-
.../pgoutput_repack/pgoutput_repack.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 195 insertions(+), 57 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 850708c7830..d7b0edc3bf8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -734,7 +734,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..b54ecc2f3bf 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
- MemoryContextSwitchTo(old_cxt);
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
logical_begin_heap_rewrite(state);
+ MemoryContextSwitchTo(old_cxt);
+
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 2ab1756fa19..8b72e4bf9ba 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -210,17 +211,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -234,7 +239,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -3207,7 +3213,7 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -3283,7 +3289,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -3291,7 +3298,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key,
+ tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -3333,11 +3341,23 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetRepackCurrentXids();
@@ -3390,9 +3410,12 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -3402,6 +3425,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -3417,6 +3443,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -3450,15 +3484,23 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap,
+ tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3468,7 +3510,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3478,6 +3520,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3501,8 +3547,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3511,7 +3558,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3593,7 +3643,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
RepackDecodingState *dstate;
@@ -3626,7 +3677,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3813,6 +3865,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Oid ident_idx_old,
ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr,
@@ -3901,11 +3954,27 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Release the locks that allowed concurrent data changes, in order to
@@ -4030,6 +4099,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
@@ -4059,11 +4133,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 0b1603cd577..1f1c3f6b59c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -984,11 +984,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1013,6 +1015,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1034,11 +1043,15 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum,
+ new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1055,6 +1068,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1064,6 +1078,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferAllocTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1082,6 +1103,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1100,11 +1129,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1136,6 +1166,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 28bd16f9cc7..24d9c9c4884 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -33,7 +33,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -168,7 +168,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -186,10 +187,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -202,7 +203,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -236,13 +238,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -317,6 +319,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 99c3f362adc..eebda35c7cb 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 6f5566210a8..0ab46f265a2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -78,6 +78,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..c2731947b22 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * REPACK CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.39.5
On Sat, Mar 22, 2025 at 5:43 AM Antonin Houska <ah@cybertec.at> wrote:
Can you please give me an example? I don't recall seeing a lock upgrade in the
tree. That's the reason I tried rather hard to avoid that.
VACUUM has to upgrade the lock in order to truncate away pages at the
end of the table.
Or just:
BEGIN;
SELECT * FROM sometable;
VACUUM FULL sometable;
COMMIT;
I don't think we should commit something that handles locking the way
this patch does. I mean, it would be one thing if you had a strategy
for avoiding erroring out when a deadlock would otherwise occur by
doing something clever. But it seems like you would just need to
detect the same problem in a different way. Doing something
non-standard doesn't make sense unless we get a clear benefit from it.
(Even then it might be unsafe, of course, but at least then you have a
motivation to take the risk.)
- On what basis do you make the statement in the last paragraph that
the decoding-related lag should not exceed one WAL segment? I guess
logical decoding probably keeps up pretty well most of the time but
this seems like a very strong guarantee for something I didn't know we
had any kind of guarantee about.The patch itself does guarantee that by checking the amount of unprocessed WAL
regularly when it's copying the data into the new table. If too much WAL
appears to be unprocessed, it enforces the decoding before the copying is
resumed.
Hmm. If the source table is not locked against writes, it seems like
we could always get into a situation where this doesn't converge --
you just need to modify the table faster than those changes can be
decoded and applied. Maybe that's different from what we're talking
about here, though.
- What happens if we crash?
The replication slot we create is RS_TEMPORARY, so it disappears after
restart. Everything else is as if the current implementation of CLUSTER ends
due to crash.
Cool.
--
Robert Haas
EDB: http://www.enterprisedb.com
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Mar-22, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.I rebased again, fixing a compiler warning reported by CI and applying
pgindent to each individual patch. I'm slowly starting to become more
familiar with the whole of this new code.
Thanks.
I did look at 0002 again [...], and ended up wondering why do we need that
change in the first place. According to the comment where the progress
restore function is called, it's because reorderbuffer.c uses a
subtransaction internally. But I went to look at reorderbuffer.c and
noticed that the subtransaction is only used "when using the SQL
function interface, because that creates a transaction already". So
maybe we should look into making REPACK use reorderbuffer without having
to open a transaction block.Which part of reorderbuffer.c do you mean? ISTM that the use of subransactions
is more extensive. At least ReorderBufferImmediateInvalidation() appears to
rely on it, which in turn is called by xact_decode().Ah, right, I was not looking hard enough. Something to keep in mind --
though I'm still not convinced that it's best to achieve this by
introducng a mechanism to restore progress state. Maybe allowing a
transaction to abort without clobbering the progress state somehow (not
trivial to implement at present though, because of layers of functions
you need to traverse with such a flag; maybe have a global in xact.c
that you set by calling a function? Not sure -- might be worse.)
The problem seems to be specific to the use of BeginInternalSubTransaction():
if it is not called at given code paths, it means that there is no top-level
transaction. However REPACK CONCURRENTLY should always run in a transaction,
so it should always run BeginInternalSubTransaction(). Thus I think we can use
this function to set the flag.
I also noticed that CI is complaining of a problem in Windows, which is
easily reproducible in non-Windows by defining EXEC_BACKEND. The
backtrace is this:#0 0x000055d4fc24fe96 in hash_search (hashp=0x5606dc2a8c88, keyPtr=0x7ffeab341928, action=HASH_FIND, foundPtr=0x0)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/hash/dynahash.c:960
960 return hash_search_with_hash_value(hashp,
(gdb) bt
#0 0x000055d4fc24fe96 in hash_search (hashp=0x5606dc2a8c88, keyPtr=0x7ffeab341928, action=HASH_FIND, foundPtr=0x0)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/hash/dynahash.c:960
#1 0x000055d4fbea0a46 in is_concurrent_repack_in_progress (relid=21973)
at ../../../../../../../../pgsql/source/master/src/backend/commands/cluster.c:2729
#2 is_concurrent_repack_in_progress (relid=relid@entry=2964)
at ../../../../../../../../pgsql/source/master/src/backend/commands/cluster.c:2706
#3 0x000055d4fc237a87 in RelationBuildDesc (targetRelId=2964, insertIt=insertIt@entry=true)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/cache/relcache.c:1257
#4 0x000055d4fc239456 in RelationIdGetRelation (relationId=<optimized out>, relationId@entry=2964)
at ../../../../../../../../../pgsql/source/master/src/backend/utils/cache/relcache.c:2105So apparently we're trying to dereference a hash table which isn't
properly set up in the child process.
I'll fix that.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Mar-22, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.I rebased again, fixing a compiler warning reported by CI and applying
pgindent to each individual patch. I'm slowly starting to become more
familiar with the whole of this new code.
I'm trying to reflect Robert's suggestions about locking [1]/messages/by-id/CA+TgmobUZ0g==SZv-OSFCQTGFPis5Qz1UsiMn18HGOWzsiyOLQ@mail.gmail.com. The next version
should be a bit simpler, so maybe wait for it before you continue studying the
code.
[1]: /messages/by-id/CA+TgmobUZ0g==SZv-OSFCQTGFPis5Qz1UsiMn18HGOWzsiyOLQ@mail.gmail.com
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Mar-22, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.I rebased again, fixing a compiler warning reported by CI and applying
pgindent to each individual patch. I'm slowly starting to become more
familiar with the whole of this new code.I'm trying to reflect Robert's suggestions about locking [1]. The next version
should be a bit simpler, so maybe wait for it before you continue studying the
code.
This is it. A few notes:
* Since there's no unlocking during the processing now, the code to check for
catalog changes was removed.
* Some code moved from 0004 to 0005, where it fits better. (Also the commit
messages of these two parts elaborated a bit more.)
* Regarding the progress monitoring, I do in 0004 what I proposed in [1]/messages/by-id/80297.1742989179@localhost:
BeginInternalSubTransaction() now sets a flag that avoids clearing of the
progress state in AbortSubTransaction()
* Although I consider 0005 (Preserve visibility information of the concurrent
data changes.) important, it occurs to me now that it might introduce quite
some overhead.
* 0003 is new in the series. I thought it'll be needed, then I realized it's
not. It might be useful as refactoring though. Please let me know if I
should maintain it or drop it.
[1]: /messages/by-id/80297.1742989179@localhost
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v11-0001-Add-REPACK-command.patchtext/x-diffDownload
From 505320bb41c90081c4212c190fb6afcc991e8caf Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 1/9] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 60 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1385 insertions(+), 236 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a6d67d2fbaa..0a6229c391a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5940,6 +5948,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..54bb2362c84 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..84f3c3e3f2b
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..735a2a7703a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..18e349c3466 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..5de46bcac52 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1262,6 +1262,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..9ae3d87e412 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1458,8 +1443,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1494,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1651,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1673,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10624353b0a..b7a74f25785 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15844,7 +15844,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index db5da3ce826..a4ad23448f8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 0fc502a3a40..9c79265a438 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11887,6 +11888,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17927,6 +17982,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18558,6 +18614,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..ddec4914ea5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 98951aef82c..31271786f21 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4913,6 +4913,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..c2976905e4d 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..7644267e14f 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index df331b1c0d9..4ef76c852f5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3921,6 +3921,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..0932d6fce5b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..da3d14bb97b 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..ed7df29b8e5 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..84ca2dc3778 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..e348e26fbfa 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..c7ea8fb93ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -416,6 +416,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2506,6 +2507,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
v11-0002-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 4a10c124f295d73663185949daaafb1dd2cfed1e Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 2/9] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..e5d2a583ce6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.43.5
v11-0003-Move-the-recheck-branch-to-a-separate-function.patchtext/x-diffDownload
From fa9ef82751fc3a1dda6a09bc149ac26adb16b995 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 3/9] Move the "recheck" branch to a separate function.
At some point I thought that the relation must be unlocked during the call of
setup_logical_decoding(), to avoid a deadlock. In that case we'd need to
recheck afterwards if the table still meets the requirements of cluster_rel().
Eventually I concluded that the risk of that deadlock is not that high, so the
table stays locked during the call of setup_logical_decoding(). Therefore the
rechecking code is only executed once per table. Anyway, this patch might be useful in terms of code readability.
---
src/backend/commands/cluster.c | 106 +++++++++++++++++++--------------
1 file changed, 61 insertions(+), 45 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9ae3d87e412..67625d52f12 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -78,6 +78,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
ClusterCommand cmd);
+static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ ClusterCommand cmd, int options);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
@@ -329,52 +331,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- {
- /* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
- {
- relation_close(OldHeap, AccessExclusiveLock);
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd,
+ params->options))
goto out;
- }
-
- /*
- * Silently skip a temp table for a remote session. Only doing this
- * check in the "recheck" case is appropriate (which currently means
- * somebody is executing a database-wide CLUSTER or on a partitioned
- * table), because there is another check in cluster() which will stop
- * any attempt to cluster remote temp tables by name. There is
- * another check in cluster_rel which is redundant, but we leave it
- * for extra safety.
- */
- if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- if (OidIsValid(indexOid))
- {
- /*
- * Check that the index still exists
- */
- if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- /*
- * Check that the index is still the one with indisclustered set,
- * if needed.
- */
- if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
- !get_index_isclustered(indexOid))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
- }
- }
/*
* We allow VACUUM FULL, but not CLUSTER, on shared catalogs. CLUSTER
@@ -459,6 +418,63 @@ out:
pgstat_progress_end_command();
}
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ ClusterCommand cmd, int options)
+{
+ Oid tableOid = RelationGetRelid(OldHeap);
+
+ /* Check that the user still has privileges for the relation */
+ if (!cluster_is_permitted_for_relation(tableOid, userid, cmd))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Silently skip a temp table for a remote session. Only doing this check
+ * in the "recheck" case is appropriate (which currently means somebody is
+ * executing a database-wide CLUSTER or on a partitioned table), because
+ * there is another check in cluster() which will stop any attempt to
+ * cluster remote temp tables by name. There is another check in
+ * cluster_rel which is redundant, but we leave it for extra safety.
+ */
+ if (RELATION_IS_OTHER_TEMP(OldHeap))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ if (OidIsValid(indexOid))
+ {
+ /*
+ * Check that the index still exists
+ */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Check that the index is still the one with indisclustered set, if
+ * needed.
+ */
+ if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+ !get_index_isclustered(indexOid))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+ }
+
+ return true;
+}
+
/*
* Verify that the specified heap and index are valid to cluster on
*
--
2.43.5
v11-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From 36b3606e1bd5a3e8c65ea69802800efe5d170a8f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 4/9] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we cannot get stronger
lock without releasing the weaker one first, we acquire the exclusive lock in
the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.
The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.
---
doc/src/sgml/monitoring.sgml | 65 +-
doc/src/sgml/ref/repack.sgml | 116 +-
src/Makefile | 1 +
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/access/heap/rewriteheap.c | 6 +-
src/backend/access/transam/xact.c | 11 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 1817 +++++++++++++++--
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 1 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 17 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 +++
src/backend/storage/ipc/ipci.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/relcache.c | 1 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 87 +-
src/include/commands/progress.h | 17 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 4 +
37 files changed, 2610 insertions(+), 263 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0a6229c391a..e385a55272b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5835,14 +5835,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6058,14 +6079,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6146,6 +6188,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 84f3c3e3f2b..9ee640e3517 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,115 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 18e349c3466..371afa6ad59 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -685,6 +689,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -705,6 +711,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -783,8 +791,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -839,7 +849,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -855,14 +865,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -874,7 +885,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -888,8 +899,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -900,9 +909,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -919,7 +966,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +981,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1051,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2023,6 +2097,53 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
return blocks_done;
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here? */
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Miscellaneous callbacks for the heap AM
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..a46e1812b21 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,13 +955,14 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
/*
* Assert that the caller has registered the snapshot. This function
* doesn't care about the registration as such, but in general you
@@ -974,6 +975,20 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1082,6 +1097,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
int options = HEAP_INSERT_SKIP_FSM;
/*
- * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
- * for the TOAST table are not logically decoded. The main heap is
- * WAL-logged as XLOG FPI records, which are not logically decoded.
+ * While rewriting the heap for REPACK, make sure data for the TOAST
+ * table are not logically decoded. The main heap is WAL-logged as
+ * XLOG FPI records, which are not logically decoded.
*/
options |= HEAP_INSERT_NO_LOGICAL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..23f2de587a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
bool parallelChildXact; /* is any parent transaction parallel? */
bool chain; /* start a new block after this one */
bool topXidLogged; /* for a subxact: is top-level XID logged? */
+ bool internal; /* for a subxact: launched internally? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -4723,6 +4724,7 @@ BeginInternalSubTransaction(const char *name)
/* Normal subtransaction start */
PushTransaction();
s = CurrentTransactionState; /* changed by push */
+ s->internal = true;
/*
* Savepoint names, like the TransactionState block itself, live
@@ -5239,7 +5241,13 @@ AbortSubTransaction(void)
LWLockReleaseAll();
pgstat_report_wait_end();
- pgstat_progress_end_command();
+
+ /*
+ * Internal subtransacion might be used by an user command, in which case
+ * the command outlives the subtransaction.
+ */
+ if (!s->internal)
+ pgstat_progress_end_command();
pgaio_error_cleanup();
@@ -5456,6 +5464,7 @@ PushTransaction(void)
s->parallelModeLevel = 0;
s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
s->topXidLogged = false;
+ s->internal = false;
CurrentTransactionState = s;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5de46bcac52..70265e5e701 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,16 +1249,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1275,16 +1276,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 67625d52f12..b1aa1e8d820 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -76,16 +86,46 @@ typedef struct
((cmd) == CLUSTER_COMMAND_REPACK ? \
"repack" : "vacuum"))
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- ClusterCommand cmd, int options);
+ ClusterCommand cmd, LOCKMODE lmode,
+ int options);
+static void check_repack_concurrently_requirements(Relation rel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool concurrent, Oid userid);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, ClusterCommand cmd,
bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -93,8 +133,53 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
static Relation process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
Oid *indexOid_p);
@@ -153,8 +238,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_CLUSTER, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel,
+ ¶ms, &indexOid);
if (rel == NULL)
return;
}
@@ -204,7 +290,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -221,8 +308,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -242,10 +329,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -269,12 +356,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which commands is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -284,8 +377,34 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
+
+ check_repack_concurrently_requirements(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -331,7 +450,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd,
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd, lmode,
params->options))
goto out;
@@ -350,6 +469,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot %s a shared catalog", cmd_str)));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -370,8 +495,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
- cmd);
+ check_index_is_clusterable(OldHeap, indexOid, lmode, cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -388,7 +512,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -401,11 +526,35 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure that our logical decoding
+ * ignores data changes of other tables than the one we are
+ * processing.
+ */
+ if (concurrent)
+ begin_concurrent_repack(OldHeap);
+
+ rebuild_relation(OldHeap, index, verbose, cmd, concurrent,
+ save_userid);
+ }
+ PG_FINALLY();
+ {
+ if (concurrent)
+ end_concurrent_repack();
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -424,14 +573,14 @@ out:
*/
static bool
cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- ClusterCommand cmd, int options)
+ ClusterCommand cmd, LOCKMODE lmode, int options)
{
Oid tableOid = RelationGetRelid(OldHeap);
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, userid, cmd))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -445,7 +594,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -456,7 +605,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -467,7 +616,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
}
@@ -611,19 +760,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool concurrent, Oid userid)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -631,21 +848,61 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Prepare to capture the concurrent data changes.
+ *
+ * Note that this call waits for all transactions with XID already
+ * assigned to finish. If some of those transactions is waiting for a
+ * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+ * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+ * risk is worth unlocking/locking the table (and its clustering
+ * index) and checking again if its still eligible for REPACK
+ * CONCURRENTLY.
+ */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ }
- if (index)
+ if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -661,30 +918,49 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
- &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
+ cmd, &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ ctx, swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -819,14 +1095,18 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- ClusterCommand cmd, bool *pSwapToastByContent,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, ClusterCommand cmd, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
@@ -845,6 +1125,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -948,8 +1229,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -981,7 +1302,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -990,7 +1313,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1448,14 +1775,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1481,39 +1807,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1825,89 +2159,1253 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
*/
-void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+static void
+begin_concurrent_repack(Relation rel)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ Oid toastrelid;
- /* Parse option list */
- foreach(lc, stmt->params)
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- DefElem *opt = (DefElem *) lfirst(lc);
+ Relation toastrel;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
}
+}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+ /*
+ * Restore normal function of (future) logical decoding for this backend.
+ */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+}
- if (stmt->relation != NULL)
- {
- rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_REPACK, ¶ms,
- &indexOid);
- if (rel == NULL)
- return;
- }
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Check if we can use logical decoding.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
- {
- Oid relid;
- bool rel_is_index;
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
{
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
- /* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
}
- else
- rtcs = get_tables_to_repack(repack_context);
+ PG_CATCH();
+ {
+ /* clear all timetravel entries */
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ simple_heap_insert(rel, tup);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ Relation *ind_refs,
+ *ind_refs_p;
+ int nind;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock - no
+ * need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+ * is one), all its indexes, so that we can swap the files.
+ *
+ * Before that, unlock the index temporarily to avoid deadlock in case
+ * another transaction is trying to lock it while holding the lock on the
+ * table.
+ */
+ if (cl_index)
+ {
+ index_close(cl_index, ShareUpdateExclusiveLock);
+ cl_index = NULL;
+ }
+ /* For the same reason, unlock TOAST relation. */
+ if (OldHeap->rd_rel->reltoastrelid)
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+ /* Finally lock the table */
+ LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+ /*
+ * Lock all indexes now, not only the clustering one: all indexes need to
+ * have their files swapped. While doing that, store their relation
+ * references in an array, to handle predicate locks below.
+ */
+ ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+ nind = 0;
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+ Relation index;
+
+ ind_oid = lfirst_oid(lc);
+ index = index_open(ind_oid, AccessExclusiveLock);
+ *ind_refs_p = index;
+ ind_refs_p++;
+ nind++;
+ }
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+ * lock is needed to swap the files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < nind; i++)
+ {
+ Relation index = ind_refs[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ /*
+ * Even ShareUpdateExclusiveLock should have prevented others from
+ * creating / dropping indexes (even using the CONCURRENTLY option), so we
+ * do not need to check whether the lists match.
+ */
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel, ¶ms, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, lockmode);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
- /* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1925,7 +3423,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd, ClusterParams *params,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel, ClusterParams *params,
Oid *indexOid_p)
{
Relation rel;
@@ -1935,12 +3434,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -1994,7 +3491,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e7854add178..df879c2a18d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b7a74f25785..2b15e5b1505 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5970,6 +5970,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a4ad23448f8..f9f8f5ebb58 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1996,7 +1997,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2264,7 +2265,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2310,7 +2311,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9c79265a438..634d0768851 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11892,27 +11892,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK qualified_name repack_index_specification
+ REPACK opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2;
- n->indexname = $3;
+ n->concurrent = $2;
+ n->relation = $3;
+ n->indexname = $4;
n->params = NIL;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ | REPACK '(' utility_option_list ')' opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5;
- n->indexname = $6;
n->params = $3;
+ n->concurrent = $5;
+ n->relation = $6;
+ n->indexname = $7;
$$ = (Node *) n;
}
@@ -11923,6 +11926,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
+ n->concurrent = false;
$$ = (Node *) n;
}
@@ -11933,6 +11937,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
+ n->concurrent = false;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..00f7bbc5f59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
+ * only decode data changes of the table that it is processing, and the
+ * changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e5d2a583ce6..c32e459411b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e9ddf39500c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4f44648aca8..1ee069c34ee 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -351,6 +351,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9f54a9e72b7..a495f22876d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 31271786f21..a22e6cb6ccc 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4914,18 +4914,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..bdeb2f83540 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -421,6 +421,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..b1ca73d6ea5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1637,6 +1640,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1649,6 +1656,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1657,6 +1666,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c2976905e4d..569cc2184b3 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,14 +52,90 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -61,6 +143,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7644267e14f..6b1b1a4c1a7 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -67,10 +67,12 @@
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 8
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_REPACK */
#define PROGRESS_REPACK_COMMAND_REPACK 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4ef76c852f5..de091ceb04a 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3931,6 +3931,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 932024b1b0b..fe9d85e5f95 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84ca2dc3778..086c61f4ef4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1969,17 +1969,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2055,17 +2055,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7ea8fb93ca..e89db0a2ee7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -477,6 +477,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1239,6 +1241,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2507,6 +2510,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.43.5
v11-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From a624052d0db98cc04bc07f55200c51facb6f3191 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 5/9] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.
However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.
To "replay" an UPDATE or DELETE command on the new table, we need the
appropriate snapshot to find the previous tuple version in the new table. The
(historic) snapshot we used to decode the UPDATE / DELETE should (by
definition) see the state of the catalog prior to that UPDATE / DELETE. Thus
we can use the same snapshot to find the "old tuple" for UPDATE / DELETE in
the new table if:
1) REPACK CONCURRENTLY preserves visibility information of all tuples - that's
the purpose of this part of the patch series.
2) The table being REPACKed is treated as a system catalog by all transactions
that modify its data. This ensures that reorderbuffer.c generates a new
snapshot for each data change in the table.
We ensure 2) by maintaining a shared hashtable of tables being REPACKed
CONCURRENTLY and by adjusting the RelationIsAccessibleInLogicalDecoding()
macro so it checks this hashtable. (The corresponding flag is also added to
the relation cache, so that the shared hashtable does not have to be accessed
too often.) It's essential that after adding an entry to the hashtable we wait
for completion of all the transactions that might have started to modify our
table before our entry has was added. We achieve that by upgrading our lock on
the table to ShareLock temporarily: as soon as we acquire it, no DML command
should be running on the table. (This lock upgrade shouldn't cause any
deadlock because we care to not hold a lock on other objects at the same
time.)
As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 82 ++--
src/backend/access/heap/heapam_handler.c | 14 +-
src/backend/access/transam/xact.c | 52 +++
src/backend/commands/cluster.c | 406 +++++++++++++++++-
src/backend/replication/logical/decode.c | 77 +++-
src/backend/replication/logical/snapbuild.c | 22 +-
.../pgoutput_repack/pgoutput_repack.c | 68 ++-
src/backend/storage/ipc/ipci.c | 2 +
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 4 +
src/include/access/heapam.h | 15 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 22 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapshot.h | 3 +
src/tools/pgindent/typedefs.list | 1 +
19 files changed, 722 insertions(+), 83 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346ce..75d889ec72c 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6e433db039e..c5baff18bc2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2084,7 +2085,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -2100,15 +2101,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2188,8 +2190,15 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * HEAP_INSERT_NO_LOGICAL should be set when applying data changes
+ * done by other transactions during REPACK CONCURRENTLY. In such a
+ * case, the insertion should not be decoded at all - see
+ * heap_decode(). (It's also set by raw_heap_insert() for TOAST, but
+ * TOAST does not pass this test anyway.)
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
@@ -2733,7 +2742,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2790,11 +2800,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2811,6 +2821,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -3036,7 +3047,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3104,8 +3116,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3126,6 +3142,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3215,10 +3240,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */ );
switch (result)
{
case TM_SelfModified:
@@ -3257,12 +3283,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3302,6 +3327,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4139,8 +4165,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -4150,7 +4180,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4505,10 +4536,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8841,7 +8872,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8852,10 +8884,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 371afa6ad59..ea1d6f299b3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -256,7 +256,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -279,7 +280,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -313,7 +315,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -331,8 +334,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f2de587a1..3db4cac030e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -126,6 +126,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -973,6 +985,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nRepackCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -992,6 +1006,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nRepackCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ RepackCurrentXids,
+ nRepackCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5649,6 +5678,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetRepackCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+ RepackCurrentXids = xip;
+ nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ * Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+ RepackCurrentXids = NULL;
+ nRepackCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b1aa1e8d820..78380c882c0 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -90,6 +90,11 @@ typedef struct
* The following definitions are used for concurrent processing.
*/
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
/*
* The locators are used to avoid logical decoding of data that we do not need
* for our table.
@@ -133,8 +138,10 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
static LogicalDecodingContext *setup_logical_decoding(Oid relid,
const char *slotname,
TupleDesc tupdesc);
@@ -154,6 +161,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -379,6 +387,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
LOCKMODE lmode;
+ bool entered,
+ success;
/*
* Check that the correct lock is held. The lock mode is
@@ -535,23 +545,30 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
+ entered = false;
+ success = false;
PG_TRY();
{
/*
- * For concurrent processing, make sure that our logical decoding
- * ignores data changes of other tables than the one we are
- * processing.
+ * For concurrent processing, make sure that
+ *
+ * 1) our logical decoding ignores data changes of other tables than
+ * the one we are processing.
+ *
+ * 2) other transactions treat this table as if it was a system / user
+ * catalog, and WAL the relevant additional information.
*/
if (concurrent)
- begin_concurrent_repack(OldHeap);
+ begin_concurrent_repack(OldHeap, &index, &entered);
rebuild_relation(OldHeap, index, verbose, cmd, concurrent,
save_userid);
+ success = true;
}
PG_FINALLY();
{
- if (concurrent)
- end_concurrent_repack();
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
}
PG_END_TRY();
@@ -2161,6 +2178,47 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
#define REPL_PLUGIN_NAME "pgoutput_repack"
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+#define MAX_REPACKED_RELS (max_replication_slots)
+
+Size
+RepackShmemSize(void)
+{
+ return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+
+ RepackedRelsHash = ShmemInitHash("Repacked Relations",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
/*
* Call this function before REPACK CONCURRENTLY starts to setup logical
* decoding. It makes sure that other users of the table put enough
@@ -2175,11 +2233,120 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
*
* Note that TOAST table needs no attention here as it's not scanned using
* historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into RepackedRelsHash or not.
*/
static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
{
- Oid toastrelid;
+ Oid relid,
+ toastrelid;
+ Relation index = NULL;
+ Oid indexid = InvalidOid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+ index = index_p ? *index_p : NULL;
+
+ /*
+ * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if anything fails below, the caller has to do cleanup in the
+ * shared memory.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry as soon
+ * as they open our table.
+ */
+ CacheInvalidateRelcacheImmediate(relid);
+
+ /*
+ * Also make sure that the existing users of the table update their
+ * relcache entry as soon as they try to run DML commands on it.
+ *
+ * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+ * has a lower lock, we assume it'll accept our invalidation message when
+ * it changes the lock mode.
+ *
+ * Before upgrading the lock on the relation, close the index temporarily
+ * to avoid a deadlock if another backend running DML already has its lock
+ * (ShareLock) on the table and waits for the lock on the index.
+ */
+ if (index)
+ {
+ index_close(index, ShareUpdateExclusiveLock);
+ indexid = RelationGetRelid(index);
+ }
+ LockRelationOid(relid, ShareLock);
+ UnlockRelationOid(relid, ShareLock);
+ if (OidIsValid(indexid))
+ {
+ /*
+ * Re-open the index and check that it hasn't changed while unlocked.
+ */
+ check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock,
+ CLUSTER_COMMAND_REPACK);
+
+ /*
+ * Return the new relcache entry to the caller. (It's been locked by
+ * the call above.)
+ */
+ index = index_open(indexid, NoLock);
+ *index_p = index;
+ }
/* Avoid logical decoding of other relations by this backend. */
repacked_rel_locator = rel->rd_locator;
@@ -2197,15 +2364,122 @@ begin_concurrent_repack(Relation rel)
/*
* Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
*/
static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
{
+ RepackedRel key;
+ RepackedRel *entry = NULL;
+ Oid relid = repacked_rel;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make others refresh their information whether they should still
+ * treat the table as catalog from the perspective of writing WAL.
+ *
+ * XXX Unlike entering the entry into the hashtable, we do not bother
+ * with locking and unlocking the table here:
+ *
+ * 1) On normal completion (and sometimes even on ERROR), the caller
+ * is already holding AccessExclusiveLock on the table, so there
+ * should be no relcache reference unaware of this change.
+ *
+ * 2) In the other cases, the worst scenario is that the other
+ * backends will write unnecessary information to WAL until they close
+ * the relation.
+ *
+ * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+ * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+ * that probably should not try to acquire any lock.)
+ */
+ CacheInvalidateRelcacheImmediate(repacked_rel);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ }
+
/*
* Restore normal function of (future) logical decoding for this backend.
*/
repacked_rel_locator.relNumber = InvalidOid;
repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+ }
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
}
/*
@@ -2267,6 +2541,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
@@ -2414,6 +2691,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
char *change_raw,
*src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -2482,8 +2760,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being REPACKed concurrently is treated like a
+ * catalog, new CID is WAL-logged and decoded. And since we use
+ * the same XID that the original DMLs did, the snapshot used for
+ * the logical decoding (by now converted to a non-historic MVCC
+ * snapshot) should see the tuples inserted previously into the
+ * new heap and/or updated there.
+ */
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -2494,6 +2794,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetRepackCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -2506,11 +2808,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -2530,10 +2835,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
- simple_heap_insert(rel, tup);
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -2541,6 +2866,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -2551,6 +2877,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetRepackCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -2568,18 +2896,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -2589,6 +2935,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -2599,7 +2946,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
}
@@ -2617,7 +2979,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -2626,7 +2988,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
HeapTuple result = NULL;
/* XXX no instrumentation for now */
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
NULL, nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -2698,6 +3060,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetRepackCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 00f7bbc5f59..25bb92b33f2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
- * only decode data changes of the table that it is processing, and the
- * changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by REPACK CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * system probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,61 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * One particular problem we solve here is that REPACK CONCURRENTLY
+ * generates WAL when doing changes in the new table. Those changes should
+ * not be decoded because reorderbuffer.c considers their XID already
+ * committed. (REPACK CONCURRENTLY deliberately generates WAL records in
+ * such a way that they are skipped here.)
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * This does happen when 1) raw_heap_insert marks the TOAST
+ * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+ * replays inserts performed by other backends.
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +987,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c32e459411b..fde4955c328 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 687fbbc59bb..28bd16f9cc7 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
RepackDecodingState *dstate;
+ Snapshot snapshot;
dstate = (RepackDecodingState *) ctx->output_writer_private;
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -266,6 +312,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -279,6 +330,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e9ddf39500c..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -151,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -344,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 4eb67720737..14eda1c24ee 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1633,6 +1633,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = relid;
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a495f22876d..679cc6be1d1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1253,6 +1253,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bdeb2f83540..b0c6f1d916f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -325,21 +325,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..fbb66d559b6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 569cc2184b3..ab1d9fc25dc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -73,6 +73,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -104,6 +112,8 @@ typedef struct RepackDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -124,6 +134,14 @@ typedef struct RepackDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} RepackDecodingState;
@@ -148,5 +166,9 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d94fddd7cef..372065fc570 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -692,7 +695,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_repack_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..014f27db7d7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e89db0a2ee7..e1e3e619c4b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2510,6 +2510,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
v11-0006-Add-regression-tests.patchtext/x-diffDownload
From 3b73055968b88e03b547407df3febd29c19a89fd Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 6/9] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 140 ++++++++++++++++++
6 files changed, 267 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 78380c882c0..a48e25deb5f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3285,6 +3286,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..49a736ed617
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..5aa8983f98d
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.43.5
v11-0007-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From cc800558cf8a952afa86d202e6dd571a5bb992c4 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 7/9] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 294 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0d02e21a1ab..f596eef7bcf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11183,6 +11183,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9ee640e3517..0c250689d13 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -188,7 +188,14 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ea1d6f299b3..850708c7830 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1008,7 +1008,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a48e25deb5f..3f07b42d615 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ static Oid repacked_rel = InvalidOid;
RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -149,7 +160,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -166,13 +178,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -2597,7 +2611,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -2627,6 +2642,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -2667,7 +2685,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2698,6 +2717,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -2822,10 +2844,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3029,11 +3063,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -3042,10 +3080,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3057,7 +3104,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3067,6 +3114,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3228,6 +3297,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Relation *ind_refs,
*ind_refs_p;
int nind;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3308,7 +3379,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3396,9 +3468,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..533fd40d383 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,8 +39,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2827,6 +2828,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ff56a1f0732..bc0217161ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -763,6 +763,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index ab1d9fc25dc..be283c70fce 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -153,7 +155,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index 49a736ed617..f2728d94222 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 5aa8983f98d..0f45f9d2544 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.43.5
v11-0008-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From eee37add3ec98f67e8713bb79eba44e75e3d4df1 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 8/9] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
src/backend/access/transam/parallel.c | 8 ++
src/backend/access/transam/xact.c | 106 +++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 94 +++++++++++++---
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/standby.c | 4 +-
src/include/access/xlog.h | 15 ++-
src/include/commands/cluster.h | 1 +
src/include/utils/rel.h | 6 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
13 files changed, 206 insertions(+), 44 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3db4cac030e..608dc5c79bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -138,6 +139,12 @@ static TransactionId *ParallelCurrentXids;
static int nRepackCurrentXids = 0;
static TransactionId *RepackCurrentXids = NULL;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -650,6 +657,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -664,6 +672,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -693,20 +727,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -731,6 +751,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2245,6 +2313,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc30a52d496..ba758deefb4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3f07b42d615..734e47eaba3 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -2203,7 +2203,16 @@ typedef struct RepackedRel
Oid dbid;
} RepackedRel;
-static HTAB *RepackedRelsHash = NULL;
+typedef struct RepackedRels
+{
+ /* Hashtable of RepackedRel elements. */
+ HTAB *hashtable;
+
+ /* The number of elements in the hashtable.. */
+ pg_atomic_uint32 nrels;
+} RepackedRels;
+
+static RepackedRels *repackedRels = NULL;
/*
* Maximum number of entries in the hashtable.
@@ -2216,22 +2225,38 @@ static HTAB *RepackedRelsHash = NULL;
Size
RepackShmemSize(void)
{
- return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+ Size result;
+
+ result = sizeof(RepackedRels);
+
+ result += hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+ return result;
}
void
RepackShmemInit(void)
{
+ bool found;
HASHCTL info;
+ repackedRels = ShmemInitStruct("Repacked Relations",
+ sizeof(RepackedRels),
+ &found);
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&repackedRels->nrels, 0);
+ }
+ else
+ Assert(found);
+
info.keysize = sizeof(RepackedRel);
info.entrysize = info.keysize;
-
- RepackedRelsHash = ShmemInitHash("Repacked Relations",
- MAX_REPACKED_RELS,
- MAX_REPACKED_RELS,
- &info,
- HASH_ELEM | HASH_BLOBS);
+ repackedRels->hashtable = ShmemInitHash("Repacked Relations Hash",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
}
/*
@@ -2266,13 +2291,14 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
*entry;
bool found;
static bool before_shmem_exit_callback_setup = false;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
relid = RelationGetRelid(rel);
index = index_p ? *index_p : NULL;
/*
- * Make sure that we do not leave an entry in RepackedRelsHash if exiting
- * due to FATAL.
+ * Make sure that we do not leave an entry in repackedRels->hashtable if
+ * exiting due to FATAL.
*/
if (!before_shmem_exit_callback_setup)
{
@@ -2287,7 +2313,7 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
*entered_p = false;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL, &found);
if (found)
{
/*
@@ -2305,9 +2331,13 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < MAX_REPACKED_RELS);
+
/*
- * Even if anything fails below, the caller has to do cleanup in the
- * shared memory.
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
*/
*entered_p = true;
@@ -2389,6 +2419,7 @@ end_concurrent_repack(bool error)
RepackedRel key;
RepackedRel *entry = NULL;
Oid relid = repacked_rel;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
/* Remove the relation from the hash if we managed to insert one. */
if (OidIsValid(repacked_rel))
@@ -2397,7 +2428,8 @@ end_concurrent_repack(bool error)
key.relid = repacked_rel;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
- entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ entry = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
+ NULL);
LWLockRelease(RepackedRelsLock);
/*
@@ -2426,6 +2458,10 @@ end_concurrent_repack(bool error)
* cluster_before_shmem_exit_callback().
*/
repacked_rel = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
/*
@@ -2478,6 +2514,8 @@ cluster_before_shmem_exit_callback(int code, Datum arg)
/*
* Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
*/
bool
is_concurrent_repack_in_progress(Oid relid)
@@ -2485,18 +2523,40 @@ is_concurrent_repack_in_progress(Oid relid)
RepackedRel key,
*entry;
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just use the counter.
+ */
+ if (!OidIsValid(relid))
+ {
+ if (pg_atomic_read_u32(&repackedRels->nrels) > 0)
+ return true;
+ else
+ return false;
+ }
+
+ /* For particular relation we need to search in the hashtable. */
memset(&key, 0, sizeof(key));
key.relid = relid;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_SHARED);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ hash_search(repackedRels->hashtable, &key, HASH_FIND, NULL);
LWLockRelease(RepackedRelsLock);
return entry != NULL;
}
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
+}
+
/*
* This function is much like pg_create_logical_replication_slot() except that
* the new slot is neither released (if anyone else could read changes from
@@ -2524,8 +2584,8 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
- * regarding logical decoding, treated like a catalog.
+ * table we are processing is present in repackedRels->hashtable and
+ * therefore, regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
NIL,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a8d2e024d34..4909432d585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..413bcc1addb 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1313,13 +1313,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index be283c70fce..0267357a261 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -172,6 +172,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
extern Size RepackShmemSize(void);
extern void RepackShmemInit(void);
extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 372065fc570..fcbad5c1720 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -710,12 +710,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1e3e619c4b..b3be8572132 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2511,6 +2511,7 @@ ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
RepackedRel
+RepackedRels
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
v11-0009-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 7835af4736e6260aee9b2ba0e5e1c4975fe7d8f2 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 31 Mar 2025 15:47:08 +0200
Subject: [PATCH 9/9] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. REPACK CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by REPACK, this
command must record the information on individual tuples being moved from the
old file to the new one. This is what logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. REPACK CONCURRENTLY is processing a relation, but once it has released all
the locks (in order to get the exclusive lock), another backend runs REPACK
CONCURRENTLY on the same table. Since the relation is treated as a system
catalog while these commands are processing it (so it can be scanned using a
historic snapshot during the "initial load"), it is important that the 2nd
backend does not break decoding of the "combo CIDs" performed by the 1st
backend.
However, it's not practical to let multiple backends run REPACK CONCURRENTLY
on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 +++++-----
src/backend/commands/cluster.c | 113 +++++++++++++++---
src/backend/replication/logical/decode.c | 42 ++++++-
.../pgoutput_repack/pgoutput_repack.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 198 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 850708c7830..d7b0edc3bf8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -734,7 +734,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6aa2ed214f2..83076b582d7 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 734e47eaba3..e4b35b10884 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -161,17 +162,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -185,7 +190,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -2746,7 +2752,7 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2821,7 +2827,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -2829,7 +2836,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key,
+ tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -2871,11 +2879,23 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetRepackCurrentXids();
@@ -2928,9 +2948,12 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -2940,6 +2963,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -2955,6 +2981,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -2988,15 +3022,23 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap,
+ tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3006,7 +3048,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3016,6 +3058,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3039,8 +3085,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3049,7 +3096,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3131,7 +3181,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
RepackDecodingState *dstate;
@@ -3164,7 +3215,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3349,6 +3401,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Oid ident_idx_old,
ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr,
@@ -3436,11 +3489,27 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3528,6 +3597,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
@@ -3557,11 +3631,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 25bb92b33f2..6f4a5f5b95b 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -984,11 +984,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1013,6 +1015,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1034,11 +1043,15 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum,
+ new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1055,6 +1068,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1064,6 +1078,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferAllocTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1082,6 +1103,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1100,11 +1129,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1136,6 +1166,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 28bd16f9cc7..24d9c9c4884 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -33,7 +33,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -168,7 +168,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -186,10 +187,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -202,7 +203,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -236,13 +238,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -317,6 +319,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 99c3f362adc..eebda35c7cb 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 0267357a261..45cd3fe4276 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -78,6 +78,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..c2731947b22 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * REPACK CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.43.5
Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Mar-22, Antonin Houska wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I rebased this patch series; here's v09. No substantive changes from v08.
I made sure the tree still compiles after each commit.I rebased again, fixing a compiler warning reported by CI and applying
pgindent to each individual patch. I'm slowly starting to become more
familiar with the whole of this new code.I'm trying to reflect Robert's suggestions about locking [1]. The next version
should be a bit simpler, so maybe wait for it before you continue studying the
code.This is it.
One more version, hopefully to make cfbot happy (I missed the bug because I
did not set the RELCACHE_FORCE_RELEASE macro in my environment.)
Besides that, it occurred to me that 0005 ("Preserve visibility information of
the concurrent data changes.") will probably introduce significant
overhead. The problem is that the table we're repacking is treated like a
catalog, for reorderbuffer.c to generate snapshots that we need to replay
UPDATE / DELETE commands on the new table.
contrib/test_decoding can be used to demonstrate the difference between
ordinary and catalog tables:
CREATE TABLE t(i int);
SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
INSERT INTO t SELECT n FROM generate_series(1, 1000000) g(n);
DELETE FROM t;
EXPLAIN ANALYZE SELECT * FROM pg_logical_slot_get_binary_changes('test_slot', null, null);
...
Execution Time: 3521.190 ms
ALTER TABLE t SET (user_catalog_table = true);
INSERT INTO t SELECT n FROM generate_series(1, 1000000) g(n);
DELETE FROM t;
EXPLAIN ANALYZE SELECT * FROM pg_logical_slot_get_binary_changes('test_slot', null, null);
...
Execution Time: 6561.634 ms
I wanted to avoid the "MVCC unsafety" [1]https://www.postgresql.org/docs/17/mvcc-caveats.html, so that both REPACK and REPACK
CONCURRENTLY both work "cleanly". We can try to optimize the logical decoding
for REPACK CONCURRENTLY, or implement 0005 in a different way, but not sure
how much effort that would require. Or implement REPACK CONCURRENTLY as
MVCC-unsafe for now? (The pg_squeeze extension also is not MVCC-safe.)
[1]: https://www.postgresql.org/docs/17/mvcc-caveats.html
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v12-0001-Add-REPACK-command.patchtext/x-diffDownload
From 2dbf5df9f3a2c7839459865f2e2fc83a1859ce23 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 1/9] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 496 +++++++++++++++++------
src/backend/commands/tablecmds.c | 3 +-
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 22 +-
src/include/commands/progress.h | 60 ++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 180 ++++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 73 ++++
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1385 insertions(+), 236 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a6d67d2fbaa..0a6229c391a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5940,6 +5948,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..54bb2362c84 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..84f3c3e3f2b
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..735a2a7703a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..18e349c3466 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..5de46bcac52 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1262,6 +1262,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..9ae3d87e412 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -46,6 +46,7 @@
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
+#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
#include "utils/inval.h"
@@ -67,17 +68,33 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+/*
+ * Map the value of ClusterCommand to string.
+ */
+#define CLUSTER_COMMAND_STR(cmd) ((cmd) == CLUSTER_COMMAND_CLUSTER ? \
+ "cluster" : \
+ ((cmd) == CLUSTER_COMMAND_REPACK ? \
+ "repack" : "vacuum"))
+
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
+ bool verbose, ClusterCommand cmd,
+ bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd,
+ ClusterParams *params,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -133,72 +150,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_CLUSTER, ¶ms,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -230,8 +186,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (rel != NULL)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
- check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ check_index_is_clusterable(rel, indexOid, AccessShareLock,
+ CLUSTER_COMMAND_CLUSTER);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +202,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +219,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +243,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +266,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -317,19 +281,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +331,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,39 +381,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
+ cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -469,7 +446,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -491,9 +468,11 @@ out:
* protection here.
*/
void
-check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
+check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode,
+ ClusterCommand cmd)
{
Relation OldIndex;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
OldIndex = index_open(indexOid, lockmode);
@@ -512,8 +491,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_indam->amclusterable)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
/*
* Disallow clustering on incomplete indexes (those that might not index
@@ -524,7 +503,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!heap_attisnull(OldIndex->rd_indextuple, Anum_pg_index_indpred, NULL))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
/*
@@ -538,8 +518,8 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, LOCKMODE lockmode)
if (!OldIndex->rd_index->indisvalid)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
/* Drop relcache refcnt on OldIndex, but keep lock */
index_close(OldIndex, NoLock);
@@ -626,7 +606,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -664,7 +645,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -829,8 +810,8 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
*/
static void
copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+ ClusterCommand cmd, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -845,6 +826,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
tups_recently_dead = 0;
BlockNumber num_pages;
int elevel = verbose ? INFO : DEBUG2;
+ const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
@@ -958,18 +940,21 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
/* Log what we're doing */
if (OldIndex != NULL && !use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
else if (use_sort)
ereport(elevel,
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
else
ereport(elevel,
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
@@ -1458,8 +1443,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1494,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1651,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1673,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,192 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterCommand cmd, ClusterParams *params,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ CLUSTER_COMMAND_STR(cmd))));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 10624353b0a..b7a74f25785 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15844,7 +15844,8 @@ ATExecClusterOn(Relation rel, const char *indexName, LOCKMODE lockmode)
indexName, RelationGetRelationName(rel))));
/* Check index is valid to cluster on */
- check_index_is_clusterable(rel, indexOid, lockmode);
+ check_index_is_clusterable(rel, indexOid, lockmode,
+ CLUSTER_COMMAND_CLUSTER);
/* And do the work */
mark_index_clustered(rel, indexOid, false);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index db5da3ce826..a4ad23448f8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 0fc502a3a40..9c79265a438 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11887,6 +11888,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17927,6 +17982,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18558,6 +18614,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..ddec4914ea5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 98951aef82c..31271786f21 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4913,6 +4913,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..c2976905e4d 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,10 +31,27 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
- LOCKMODE lockmode);
+ LOCKMODE lockmode,
+ ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
@@ -48,4 +65,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..7644267e14f 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index df331b1c0d9..4ef76c852f5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3921,6 +3921,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..0932d6fce5b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -373,6 +373,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..da3d14bb97b 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -25,6 +25,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..ed7df29b8e5 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,120 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+ 0 | 100 | in child table 3 | |
+(35 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +495,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +638,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..84ca2dc3778 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,6 +2041,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..e348e26fbfa 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,33 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 3');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +186,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +284,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..c7ea8fb93ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -416,6 +416,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2506,6 +2507,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
v12-0002-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 6afa32d3ea9d92d4eccd7a9befb2d4991418e649 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 2/9] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index b64e53de017..e5d2a583ce6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.43.5
v12-0003-Move-the-recheck-branch-to-a-separate-function.patchtext/x-diffDownload
From 8584b47b926b8ba4f917db9af5337486228c81a1 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 3/9] Move the "recheck" branch to a separate function.
At some point I thought that the relation must be unlocked during the call of
setup_logical_decoding(), to avoid a deadlock. In that case we'd need to
recheck afterwards if the table still meets the requirements of cluster_rel().
Eventually I concluded that the risk of that deadlock is not that high, so the
table stays locked during the call of setup_logical_decoding(). Therefore the
rechecking code is only executed once per table. Anyway, this patch might be useful in terms of code readability.
---
src/backend/commands/cluster.c | 106 +++++++++++++++++++--------------
1 file changed, 61 insertions(+), 45 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9ae3d87e412..67625d52f12 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -78,6 +78,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
ClusterCommand cmd);
+static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ ClusterCommand cmd, int options);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
@@ -329,52 +331,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- {
- /* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid, cmd))
- {
- relation_close(OldHeap, AccessExclusiveLock);
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd,
+ params->options))
goto out;
- }
-
- /*
- * Silently skip a temp table for a remote session. Only doing this
- * check in the "recheck" case is appropriate (which currently means
- * somebody is executing a database-wide CLUSTER or on a partitioned
- * table), because there is another check in cluster() which will stop
- * any attempt to cluster remote temp tables by name. There is
- * another check in cluster_rel which is redundant, but we leave it
- * for extra safety.
- */
- if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- if (OidIsValid(indexOid))
- {
- /*
- * Check that the index still exists
- */
- if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- /*
- * Check that the index is still the one with indisclustered set,
- * if needed.
- */
- if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
- !get_index_isclustered(indexOid))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
- }
- }
/*
* We allow VACUUM FULL, but not CLUSTER, on shared catalogs. CLUSTER
@@ -459,6 +418,63 @@ out:
pgstat_progress_end_command();
}
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ ClusterCommand cmd, int options)
+{
+ Oid tableOid = RelationGetRelid(OldHeap);
+
+ /* Check that the user still has privileges for the relation */
+ if (!cluster_is_permitted_for_relation(tableOid, userid, cmd))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Silently skip a temp table for a remote session. Only doing this check
+ * in the "recheck" case is appropriate (which currently means somebody is
+ * executing a database-wide CLUSTER or on a partitioned table), because
+ * there is another check in cluster() which will stop any attempt to
+ * cluster remote temp tables by name. There is another check in
+ * cluster_rel which is redundant, but we leave it for extra safety.
+ */
+ if (RELATION_IS_OTHER_TEMP(OldHeap))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ if (OidIsValid(indexOid))
+ {
+ /*
+ * Check that the index still exists
+ */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Check that the index is still the one with indisclustered set, if
+ * needed.
+ */
+ if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+ !get_index_isclustered(indexOid))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+ }
+
+ return true;
+}
+
/*
* Verify that the specified heap and index are valid to cluster on
*
--
2.43.5
v12-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From 8fc1631deb223682d60eaf18c87b0abc42db5214 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 4/9] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we cannot get stronger
lock without releasing the weaker one first, we acquire the exclusive lock in
the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even write into it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock we need to swap the files. (Of course, more data
changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.
The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.
---
doc/src/sgml/monitoring.sgml | 65 +-
doc/src/sgml/ref/repack.sgml | 116 +-
src/Makefile | 1 +
src/backend/access/heap/heapam_handler.c | 145 +-
src/backend/access/heap/heapam_visibility.c | 30 +-
src/backend/access/heap/rewriteheap.c | 6 +-
src/backend/access/transam/xact.c | 11 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 1818 +++++++++++++++--
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 1 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 17 +-
src/backend/replication/logical/decode.c | 24 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 +++
src/backend/storage/ipc/ipci.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/relcache.c | 1 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 4 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 87 +-
src/include/commands/progress.h | 17 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 4 +
37 files changed, 2611 insertions(+), 263 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0a6229c391a..e385a55272b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5835,14 +5835,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6058,14 +6079,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6146,6 +6188,13 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 84f3c3e3f2b..9ee640e3517 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,115 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
+ be applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 18e349c3466..371afa6ad59 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -53,6 +54,9 @@ static void reform_and_rewrite_tuple(HeapTuple tuple,
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
OffsetNumber tupoffset);
+static HeapTuple accept_tuple_for_concurrent_copy(HeapTuple tuple,
+ Snapshot snapshot,
+ Buffer buffer);
static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
@@ -685,6 +689,8 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -705,6 +711,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -783,8 +791,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
for (;;)
{
HeapTuple tuple;
+ bool tuple_copied = false;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -839,7 +849,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
LockBuffer(buf, BUFFER_LOCK_SHARE);
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
{
case HEAPTUPLE_DEAD:
/* Definitely dead */
@@ -855,14 +865,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
* catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -874,7 +885,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/*
* Similar situation to INSERT_IN_PROGRESS case.
*/
- if (!is_system_catalog &&
+ if (!is_system_catalog && !concurrent &&
!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
elog(WARNING, "concurrent delete in progress within table \"%s\"",
RelationGetRelationName(OldHeap));
@@ -888,8 +899,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
break;
}
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
if (isdead)
{
*tups_vacuumed += 1;
@@ -900,9 +909,47 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*tups_vacuumed += 1;
*tups_recently_dead -= 1;
}
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
continue;
}
+ if (concurrent)
+ {
+ /*
+ * Ignore concurrent changes now, they'll be processed later via
+ * logical decoding.
+ *
+ * INSERT_IN_PROGRESS is rejected right away because our snapshot
+ * should represent a point in time which should precede (or be
+ * equal to) the state of transactions as it was when the
+ * "SatisfiesVacuum" test was performed. Thus
+ * accept_tuple_for_concurrent_copy() should not consider the
+ * tuple inserted.
+ */
+ if (vis == HEAPTUPLE_INSERT_IN_PROGRESS)
+ tuple = NULL;
+ else
+ tuple = accept_tuple_for_concurrent_copy(tuple, snapshot,
+ buf);
+ /* Tuple not suitable for the new heap? */
+ if (tuple == NULL)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+
+ /* Remember that we have to free the tuple eventually. */
+ tuple_copied = true;
+ }
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we don't
+ * worry whether the source tuple will be deleted / updated after we
+ * release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
*num_tuples += 1;
if (tuplesort != NULL)
{
@@ -919,7 +966,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +981,33 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+ if (tuple_copied)
+ heap_freetuple(tuple);
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1051,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -2023,6 +2097,53 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
return blocks_done;
}
+/*
+ * Return copy of 'tuple' if it has been inserted according to 'snapshot', or
+ * NULL if the insertion took place in the future. If the tuple is already
+ * marked as deleted or updated by a transaction that 'snapshot' still
+ * considers running, clear the deletion / update XID in the header of the
+ * copied tuple. This way the returned tuple is suitable for insertion into
+ * the new heap.
+ */
+static HeapTuple
+accept_tuple_for_concurrent_copy(HeapTuple tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ HeapTuple result;
+
+ Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
+
+ /*
+ * First, check if the tuple insertion is visible by our snapshot.
+ */
+ if (!HeapTupleMVCCInserted(tuple, snapshot, buffer))
+ return NULL;
+
+ result = heap_copytuple(tuple);
+
+ /*
+ * If the tuple was deleted / updated but our snapshot still sees it, we
+ * need to keep it. In that case, clear the information that indicates the
+ * deletion / update. Otherwise the tuple chain would stay incomplete (as
+ * we will reject the new tuple above), and the delete / update would fail
+ * if executed later during logical decoding.
+ */
+ if (TransactionIdIsNormal(HeapTupleHeaderGetRawXmax(result->t_data)) &&
+ HeapTupleMVCCNotDeleted(result, snapshot, buffer))
+ {
+ /* TODO More work needed here? */
+ result->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ HeapTupleHeaderSetXmax(result->t_data, 0);
+ }
+
+ /*
+ * Accept the tuple even if our snapshot considers it deleted - older
+ * snapshots can still see the tuple, while the decoded transactions
+ * should not try to update / delete it again.
+ */
+ return result;
+}
+
/* ------------------------------------------------------------------------
* Miscellaneous callbacks for the heap AM
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..a46e1812b21 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -955,13 +955,14 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
* did TransactionIdIsInProgress in each call --- to no avail, as long as the
* inserting/deleting transaction was still running --- which was more cycles
* and more contention on ProcArrayLock.
+ *
+ * The checks are split into two functions, HeapTupleMVCCInserted() and
+ * HeapTupleMVCCNotDeleted(), because they are also useful separately.
*/
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
- HeapTupleHeader tuple = htup->t_data;
-
/*
* Assert that the caller has registered the snapshot. This function
* doesn't care about the registration as such, but in general you
@@ -974,6 +975,20 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ return HeapTupleMVCCInserted(htup, snapshot, buffer) &&
+ HeapTupleMVCCNotDeleted(htup, snapshot, buffer);
+}
+
+/*
+ * HeapTupleMVCCInserted
+ * True iff heap tuple was successfully inserted for the given MVCC
+ * snapshot.
+ */
+bool
+HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -1082,6 +1097,17 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
}
/* by here, the inserting transaction has committed */
+ return true;
+}
+
+/*
+ * HeapTupleMVCCNotDeleted
+ * True iff heap tuple was not deleted for the given MVCC snapshot.
+ */
+bool
+HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+ HeapTupleHeader tuple = htup->t_data;
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
int options = HEAP_INSERT_SKIP_FSM;
/*
- * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
- * for the TOAST table are not logically decoded. The main heap is
- * WAL-logged as XLOG FPI records, which are not logically decoded.
+ * While rewriting the heap for REPACK, make sure data for the TOAST
+ * table are not logically decoded. The main heap is WAL-logged as
+ * XLOG FPI records, which are not logically decoded.
*/
options |= HEAP_INSERT_NO_LOGICAL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..23f2de587a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
bool parallelChildXact; /* is any parent transaction parallel? */
bool chain; /* start a new block after this one */
bool topXidLogged; /* for a subxact: is top-level XID logged? */
+ bool internal; /* for a subxact: launched internally? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -4723,6 +4724,7 @@ BeginInternalSubTransaction(const char *name)
/* Normal subtransaction start */
PushTransaction();
s = CurrentTransactionState; /* changed by push */
+ s->internal = true;
/*
* Savepoint names, like the TransactionState block itself, live
@@ -5239,7 +5241,13 @@ AbortSubTransaction(void)
LWLockReleaseAll();
pgstat_report_wait_end();
- pgstat_progress_end_command();
+
+ /*
+ * Internal subtransacion might be used by an user command, in which case
+ * the command outlives the subtransaction.
+ */
+ if (!s->internal)
+ pgstat_progress_end_command();
pgaio_error_cleanup();
@@ -5456,6 +5464,7 @@ PushTransaction(void)
s->parallelModeLevel = 0;
s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
s->topXidLogged = false;
+ s->internal = false;
CurrentTransactionState = s;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5de46bcac52..70265e5e701 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,16 +1249,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1275,16 +1276,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 67625d52f12..4d08a28ff7e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -76,16 +86,46 @@ typedef struct
((cmd) == CLUSTER_COMMAND_REPACK ? \
"repack" : "vacuum"))
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- ClusterCommand cmd, int options);
+ ClusterCommand cmd, LOCKMODE lmode,
+ int options);
+static void check_repack_concurrently_requirements(Relation rel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool concurrent, Oid userid);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
bool verbose, ClusterCommand cmd,
bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -93,8 +133,53 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
static Relation process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
Oid *indexOid_p);
@@ -153,8 +238,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_CLUSTER, ¶ms,
- &indexOid);
+ CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel,
+ ¶ms, &indexOid);
if (rel == NULL)
return;
}
@@ -204,7 +290,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -221,8 +308,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -242,10 +329,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -269,12 +356,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params,
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which commands is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -284,8 +377,34 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
+
+ check_repack_concurrently_requirements(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -331,7 +450,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd,
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, cmd, lmode,
params->options))
goto out;
@@ -350,6 +469,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot %s a shared catalog", cmd_str)));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -370,8 +495,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock,
- cmd);
+ check_index_is_clusterable(OldHeap, indexOid, lmode, cmd);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -388,7 +512,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ if (index)
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -401,11 +527,35 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure that our logical decoding
+ * ignores data changes of other tables than the one we are
+ * processing.
+ */
+ if (concurrent)
+ begin_concurrent_repack(OldHeap);
+
+ rebuild_relation(OldHeap, index, verbose, cmd, concurrent,
+ save_userid);
+ }
+ PG_FINALLY();
+ {
+ if (concurrent)
+ end_concurrent_repack();
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -424,14 +574,14 @@ out:
*/
static bool
cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- ClusterCommand cmd, int options)
+ ClusterCommand cmd, LOCKMODE lmode, int options)
{
Oid tableOid = RelationGetRelid(OldHeap);
/* Check that the user still has privileges for the relation */
if (!cluster_is_permitted_for_relation(tableOid, userid, cmd))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -445,7 +595,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -456,7 +606,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -467,7 +617,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
}
@@ -611,19 +761,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool concurrent, Oid userid)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -631,21 +849,61 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
+
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Prepare to capture the concurrent data changes.
+ *
+ * Note that this call waits for all transactions with XID already
+ * assigned to finish. If some of those transactions is waiting for a
+ * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+ * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+ * risk is worth unlocking/locking the table (and its clustering
+ * index) and checking again if its still eligible for REPACK
+ * CONCURRENTLY.
+ */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ }
- if (index)
+ if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -661,30 +919,49 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose, cmd,
- &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
+ cmd, &swap_toast_by_content, &frozenXid, &cutoffMulti);
+ if (concurrent)
+ {
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ ctx, swap_toast_by_content,
+ frozenXid, cutoffMulti);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ FreeSnapshot(snapshot);
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -819,14 +1096,18 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- ClusterCommand cmd, bool *pSwapToastByContent,
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, ClusterCommand cmd, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
@@ -845,6 +1126,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
pg_rusage_init(&ru0);
@@ -948,8 +1230,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -981,7 +1303,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -990,7 +1314,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1448,14 +1776,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1481,39 +1808,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1825,89 +2160,1253 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
*/
-void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+static void
+begin_concurrent_repack(Relation rel)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ Oid toastrelid;
- /* Parse option list */
- foreach(lc, stmt->params)
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- DefElem *opt = (DefElem *) lfirst(lc);
+ Relation toastrel;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
}
+}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+ /*
+ * Restore normal function of (future) logical decoding for this backend.
+ */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+}
- if (stmt->relation != NULL)
- {
- rel = process_single_relation(stmt->relation, stmt->indexname,
- CLUSTER_COMMAND_REPACK, ¶ms,
- &indexOid);
- if (rel == NULL)
- return;
- }
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Check if we can use logical decoding.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
- {
- Oid relid;
- bool rel_is_index;
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
{
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
- /* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
}
- else
- rtcs = get_tables_to_repack(repack_context);
+ PG_CATCH();
+ {
+ /* clear all timetravel entries */
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /* If there's any change, make it visible to the next iteration. */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ simple_heap_insert(rel, tup);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ List *recheck;
+ TU_UpdateIndexes update_indexes;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ */
+ simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ simple_heap_delete(rel, &tup_target->t_self);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ Relation *ind_refs,
+ *ind_refs_p;
+ int nind;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock - no
+ * need to change that.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+ * is one), all its indexes, so that we can swap the files.
+ *
+ * Before that, unlock the index temporarily to avoid deadlock in case
+ * another transaction is trying to lock it while holding the lock on the
+ * table.
+ */
+ if (cl_index)
+ {
+ index_close(cl_index, ShareUpdateExclusiveLock);
+ cl_index = NULL;
+ }
+ /* For the same reason, unlock TOAST relation. */
+ if (OldHeap->rd_rel->reltoastrelid)
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+ /* Finally lock the table */
+ LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+ /*
+ * Lock all indexes now, not only the clustering one: all indexes need to
+ * have their files swapped. While doing that, store their relation
+ * references in an array, to handle predicate locks below.
+ */
+ ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+ nind = 0;
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+ Relation index;
+
+ ind_oid = lfirst_oid(lc);
+ index = index_open(ind_oid, AccessExclusiveLock);
+ *ind_refs_p = index;
+ ind_refs_p++;
+ nind++;
+ }
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+ * lock is needed to swap the files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < nind; i++)
+ {
+ Relation index = ind_refs[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ /*
+ * Even ShareUpdateExclusiveLock should have prevented others from
+ * creating / dropping indexes (even using the CONCURRENTLY option), so we
+ * do not need to check whether the lists match.
+ */
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel, ¶ms, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, lockmode);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
- /* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1925,7 +3424,8 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
- ClusterCommand cmd, ClusterParams *params,
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel, ClusterParams *params,
Oid *indexOid_p)
{
Relation rel;
@@ -1935,12 +3435,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -1994,7 +3492,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e7854add178..df879c2a18d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b7a74f25785..2b15e5b1505 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5970,6 +5970,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a4ad23448f8..f9f8f5ebb58 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1996,7 +1997,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2264,7 +2265,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2310,7 +2311,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9c79265a438..634d0768851 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11892,27 +11892,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK qualified_name repack_index_specification
+ REPACK opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2;
- n->indexname = $3;
+ n->concurrent = $2;
+ n->relation = $3;
+ n->indexname = $4;
n->params = NIL;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ | REPACK '(' utility_option_list ')' opt_concurrently qualified_name repack_index_specification
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5;
- n->indexname = $6;
n->params = $3;
+ n->concurrent = $5;
+ n->relation = $6;
+ n->indexname = $7;
$$ = (Node *) n;
}
@@ -11923,6 +11926,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
+ n->concurrent = false;
$$ = (Node *) n;
}
@@ -11933,6 +11937,7 @@ RepackStmt:
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
+ n->concurrent = false;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..00f7bbc5f59 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,29 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
+ * only decode data changes of the table that it is processing, and the
+ * changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e5d2a583ce6..c32e459411b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e9ddf39500c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4f44648aca8..1ee069c34ee 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -351,6 +351,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9f54a9e72b7..a495f22876d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 31271786f21..a22e6cb6ccc 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4914,18 +4914,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..bdeb2f83540 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -421,6 +421,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b8cb1e744ad..b1ca73d6ea5 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -630,6 +631,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1637,6 +1640,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1649,6 +1656,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1657,6 +1666,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index c2976905e4d..569cc2184b3 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,14 +52,90 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -61,6 +143,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7644267e14f..6b1b1a4c1a7 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -67,10 +67,12 @@
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 8
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/* Commands of PROGRESS_REPACK */
#define PROGRESS_REPACK_COMMAND_REPACK 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4ef76c852f5..de091ceb04a 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3931,6 +3931,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 932024b1b0b..fe9d85e5f95 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84ca2dc3778..086c61f4ef4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1969,17 +1969,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2055,17 +2055,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7ea8fb93ca..e89db0a2ee7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -477,6 +477,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1239,6 +1241,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2507,6 +2510,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.43.5
v12-0005-Preserve-visibility-information-of-the-concurrent-da.patchtext/x-diffDownload
From 16133cdbdcfc338d8a30ae058bdf7930a72ae888 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 5/9] Preserve visibility information of the concurrent data
changes.
As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.
However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.
To "replay" an UPDATE or DELETE command on the new table, we need the
appropriate snapshot to find the previous tuple version in the new table. The
(historic) snapshot we used to decode the UPDATE / DELETE should (by
definition) see the state of the catalog prior to that UPDATE / DELETE. Thus
we can use the same snapshot to find the "old tuple" for UPDATE / DELETE in
the new table if:
1) REPACK CONCURRENTLY preserves visibility information of all tuples - that's
the purpose of this part of the patch series.
2) The table being REPACKed is treated as a system catalog by all transactions
that modify its data. This ensures that reorderbuffer.c generates a new
snapshot for each data change in the table.
We ensure 2) by maintaining a shared hashtable of tables being REPACKed
CONCURRENTLY and by adjusting the RelationIsAccessibleInLogicalDecoding()
macro so it checks this hashtable. (The corresponding flag is also added to
the relation cache, so that the shared hashtable does not have to be accessed
too often.) It's essential that after adding an entry to the hashtable we wait
for completion of all the transactions that might have started to modify our
table before our entry has was added. We achieve that by upgrading our lock on
the table to ShareLock temporarily: as soon as we acquire it, no DML command
should be running on the table. (This lock upgrade shouldn't cause any
deadlock because we care to not hold a lock on other objects at the same
time.)
As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)
---
src/backend/access/common/toast_internals.c | 3 +-
src/backend/access/heap/heapam.c | 82 ++--
src/backend/access/heap/heapam_handler.c | 14 +-
src/backend/access/transam/xact.c | 52 +++
src/backend/commands/cluster.c | 406 +++++++++++++++++-
src/backend/replication/logical/decode.c | 77 +++-
src/backend/replication/logical/snapbuild.c | 22 +-
.../pgoutput_repack/pgoutput_repack.c | 68 ++-
src/backend/storage/ipc/ipci.c | 2 +
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 4 +
src/include/access/heapam.h | 15 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/xact.h | 2 +
src/include/commands/cluster.h | 22 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 7 +-
src/include/utils/snapshot.h | 3 +
src/tools/pgindent/typedefs.list | 1 +
19 files changed, 722 insertions(+), 83 deletions(-)
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 7d8be8346ce..75d889ec72c 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
- heap_insert(toastrel, toasttup, mycid, options, NULL);
+ heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+ options, NULL);
/*
* Create the index entry. We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cedaa195cb6..8299f3b3ded 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2087,7 +2088,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
/*
* heap_insert - insert tuple into a heap
*
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
* command ID.
*
* See table_tuple_insert for comments about most of the input flags, except
@@ -2103,15 +2104,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
* reflected into *tup.
*/
void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate)
{
- TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ Assert(TransactionIdIsValid(xid));
+
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
RelationGetNumberOfAttributes(relation));
@@ -2191,8 +2193,15 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
/*
* If this is a catalog, we need to transmit combo CIDs to properly
* decode, so log that as well.
+ *
+ * HEAP_INSERT_NO_LOGICAL should be set when applying data changes
+ * done by other transactions during REPACK CONCURRENTLY. In such a
+ * case, the insertion should not be decoded at all - see
+ * heap_decode(). (It's also set by raw_heap_insert() for TOAST, but
+ * TOAST does not pass this test anyway.)
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
@@ -2736,7 +2745,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
void
simple_heap_insert(Relation relation, HeapTuple tup)
{
- heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+ heap_insert(relation, tup, GetCurrentTransactionId(),
+ GetCurrentCommandId(true), 0, NULL);
}
/*
@@ -2793,11 +2803,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*/
TM_Result
heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+ TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
ItemId lp;
HeapTupleData tp;
Page page;
@@ -2814,6 +2824,7 @@ heap_delete(Relation relation, ItemPointer tid,
bool old_key_copied = false;
Assert(ItemPointerIsValid(tid));
+ Assert(TransactionIdIsValid(xid));
/*
* Forbid this during a parallel operation, lest it allocate a combo CID.
@@ -3039,7 +3050,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3107,8 +3119,12 @@ l1:
/*
* For logical decode we need combo CIDs to properly decode the
* catalog
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
xlrec.flags = 0;
@@ -3129,6 +3145,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3218,10 +3243,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
TM_Result result;
TM_FailureData tmfd;
- result = heap_delete(relation, tid,
+ result = heap_delete(relation, tid, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */ );
switch (result)
{
case TM_SelfModified:
@@ -3260,12 +3286,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
*/
TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
- CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TransactionId xid, CommandId cid, Snapshot crosscheck,
+ bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
- TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
Bitmapset *sum_attrs;
Bitmapset *key_attrs;
@@ -3305,6 +3330,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
infomask2_new_tuple;
Assert(ItemPointerIsValid(otid));
+ Assert(TransactionIdIsValid(xid));
/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4142,8 +4168,12 @@ l2:
/*
* For logical decoding we need combo CIDs to properly decode the
* catalog.
+ *
+ * Like in heap_insert(), visibility is unchanged when called from
+ * VACUUM FULL / CLUSTER.
*/
- if (RelationIsAccessibleInLogicalDecoding(relation))
+ if (wal_logical &&
+ RelationIsAccessibleInLogicalDecoding(relation))
{
log_heap_new_cid(relation, &oldtup);
log_heap_new_cid(relation, heaptup);
@@ -4153,7 +4183,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4508,10 +4539,10 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
TM_FailureData tmfd;
LockTupleMode lockmode;
- result = heap_update(relation, otid, tup,
+ result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes, true);
switch (result)
{
case TM_SelfModified:
@@ -8844,7 +8875,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8855,10 +8887,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data;
bool init;
int bufflags;
+ need_tuple_data = RelationIsLogicallyLogged(reln) && wal_logical;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 371afa6ad59..ea1d6f299b3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -256,7 +256,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
tuple->t_tableOid = slot->tts_tableOid;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -279,7 +280,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
options |= HEAP_INSERT_SPECULATIVE;
/* Perform the insertion, and copy the resulting ItemPointer */
- heap_insert(relation, tuple, cid, options, bistate);
+ heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+ bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
@@ -313,7 +315,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+ crosscheck, wait, tmfd, changingPart, true);
}
@@ -331,8 +334,9 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
- result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+ cid, crosscheck, wait,
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f2de587a1..3db4cac030e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -126,6 +126,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -973,6 +985,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
int low,
high;
+ Assert(nRepackCurrentXids == 0);
+
low = 0;
high = nParallelCurrentXids - 1;
while (low <= high)
@@ -992,6 +1006,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
return false;
}
+ /*
+ * When executing CLUSTER CONCURRENTLY, the array of current transactions
+ * is given.
+ */
+ if (nRepackCurrentXids > 0)
+ {
+ Assert(nParallelCurrentXids == 0);
+
+ return bsearch(&xid,
+ RepackCurrentXids,
+ nRepackCurrentXids,
+ sizeof(TransactionId),
+ xidComparator) != NULL;
+ }
+
/*
* We will return true for the Xid of the current subtransaction, any of
* its subcommitted children, any of its parents, or any of their
@@ -5649,6 +5678,29 @@ EndParallelWorkerTransaction(void)
CurrentTransactionState->blockState = TBLOCK_DEFAULT;
}
+/*
+ * SetRepackCurrentXids
+ * Set the XID array that TransactionIdIsCurrentTransactionId() should
+ * use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+ RepackCurrentXids = xip;
+ nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ * Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+ RepackCurrentXids = NULL;
+ nRepackCurrentXids = 0;
+}
+
/*
* ShowTransactionState
* Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4d08a28ff7e..e30c1a78e9c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -90,6 +90,11 @@ typedef struct
* The following definitions are used for concurrent processing.
*/
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
/*
* The locators are used to avoid logical decoding of data that we do not need
* for our table.
@@ -133,8 +138,10 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
static LogicalDecodingContext *setup_logical_decoding(Oid relid,
const char *slotname,
TupleDesc tupdesc);
@@ -154,6 +161,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
+ Snapshot snapshot,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
@@ -379,6 +387,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
const char *cmd_str = CLUSTER_COMMAND_STR(cmd);
bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
LOCKMODE lmode;
+ bool entered,
+ success;
/*
* Check that the correct lock is held. The lock mode is
@@ -536,23 +546,30 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
+ entered = false;
+ success = false;
PG_TRY();
{
/*
- * For concurrent processing, make sure that our logical decoding
- * ignores data changes of other tables than the one we are
- * processing.
+ * For concurrent processing, make sure that
+ *
+ * 1) our logical decoding ignores data changes of other tables than
+ * the one we are processing.
+ *
+ * 2) other transactions treat this table as if it was a system / user
+ * catalog, and WAL the relevant additional information.
*/
if (concurrent)
- begin_concurrent_repack(OldHeap);
+ begin_concurrent_repack(OldHeap, &index, &entered);
rebuild_relation(OldHeap, index, verbose, cmd, concurrent,
save_userid);
+ success = true;
}
PG_FINALLY();
{
- if (concurrent)
- end_concurrent_repack();
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
}
PG_END_TRY();
@@ -2162,6 +2179,47 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
#define REPL_PLUGIN_NAME "pgoutput_repack"
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+#define MAX_REPACKED_RELS (max_replication_slots)
+
+Size
+RepackShmemSize(void)
+{
+ return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+
+ RepackedRelsHash = ShmemInitHash("Repacked Relations",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
+}
+
/*
* Call this function before REPACK CONCURRENTLY starts to setup logical
* decoding. It makes sure that other users of the table put enough
@@ -2176,11 +2234,120 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
*
* Note that TOAST table needs no attention here as it's not scanned using
* historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into RepackedRelsHash or not.
*/
static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
{
- Oid toastrelid;
+ Oid relid,
+ toastrelid;
+ Relation index = NULL;
+ Oid indexid = InvalidOid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+ index = index_p ? *index_p : NULL;
+
+ /*
+ * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if anything fails below, the caller has to do cleanup in the
+ * shared memory.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry as soon
+ * as they open our table.
+ */
+ CacheInvalidateRelcacheImmediate(relid);
+
+ /*
+ * Also make sure that the existing users of the table update their
+ * relcache entry as soon as they try to run DML commands on it.
+ *
+ * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+ * has a lower lock, we assume it'll accept our invalidation message when
+ * it changes the lock mode.
+ *
+ * Before upgrading the lock on the relation, close the index temporarily
+ * to avoid a deadlock if another backend running DML already has its lock
+ * (ShareLock) on the table and waits for the lock on the index.
+ */
+ if (index)
+ {
+ indexid = RelationGetRelid(index);
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+ LockRelationOid(relid, ShareLock);
+ UnlockRelationOid(relid, ShareLock);
+ if (OidIsValid(indexid))
+ {
+ /*
+ * Re-open the index and check that it hasn't changed while unlocked.
+ */
+ check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock,
+ CLUSTER_COMMAND_REPACK);
+
+ /*
+ * Return the new relcache entry to the caller. (It's been locked by
+ * the call above.)
+ */
+ index = index_open(indexid, NoLock);
+ *index_p = index;
+ }
/* Avoid logical decoding of other relations by this backend. */
repacked_rel_locator = rel->rd_locator;
@@ -2198,15 +2365,122 @@ begin_concurrent_repack(Relation rel)
/*
* Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
*/
static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
{
+ RepackedRel key;
+ RepackedRel *entry = NULL;
+ Oid relid = repacked_rel;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make others refresh their information whether they should still
+ * treat the table as catalog from the perspective of writing WAL.
+ *
+ * XXX Unlike entering the entry into the hashtable, we do not bother
+ * with locking and unlocking the table here:
+ *
+ * 1) On normal completion (and sometimes even on ERROR), the caller
+ * is already holding AccessExclusiveLock on the table, so there
+ * should be no relcache reference unaware of this change.
+ *
+ * 2) In the other cases, the worst scenario is that the other
+ * backends will write unnecessary information to WAL until they close
+ * the relation.
+ *
+ * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+ * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+ * that probably should not try to acquire any lock.)
+ */
+ CacheInvalidateRelcacheImmediate(repacked_rel);
+
+ /*
+ * By clearing this variable we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ }
+
/*
* Restore normal function of (future) logical decoding for this backend.
*/
repacked_rel_locator.relNumber = InvalidOid;
repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+ }
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ entry = (RepackedRel *)
+ hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
}
/*
@@ -2268,6 +2542,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
dstate->relid = relid;
dstate->tstore = tuplestore_begin_heap(false, false,
maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = InvalidTransactionId;
+#endif
dstate->tupdesc = tupdesc;
@@ -2415,6 +2692,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
char *change_raw,
*src;
ConcurrentChange change;
+ Snapshot snapshot;
bool isnull[1];
Datum values[1];
@@ -2483,8 +2761,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/*
* Find the tuple to be updated or deleted.
+ *
+ * As the table being REPACKed concurrently is treated like a
+ * catalog, new CID is WAL-logged and decoded. And since we use
+ * the same XID that the original DMLs did, the snapshot used for
+ * the logical decoding (by now converted to a non-historic MVCC
+ * snapshot) should see the tuples inserted previously into the
+ * new heap and/or updated there.
+ */
+ snapshot = change.snapshot;
+
+ /*
+ * Set what should be considered current transaction (and
+ * subtransactions) during visibility check.
+ *
+ * Note that this snapshot was created from a historic snapshot
+ * using SnapBuildMVCCFromHistoric(), which does not touch
+ * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+ * only contains the transactions whose data changes we are
+ * applying, and its subtransactions. That's exactly what we need
+ * to check if particular xact is a "current transaction:".
*/
- tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
iistate, ident_slot, &ind_scan);
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
@@ -2495,6 +2795,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
apply_concurrent_delete(rel, tup_exist, &change);
+ ResetRepackCurrentXids();
+
if (tup_old != NULL)
{
pfree(tup_old);
@@ -2507,11 +2809,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
else
elog(ERROR, "Unrecognized kind of change: %d", change.kind);
- /* If there's any change, make it visible to the next iteration. */
- if (change.kind != CHANGE_UPDATE_OLD)
+ /* Free the snapshot if this is the last change that needed it. */
+ Assert(change.snapshot->active_count > 0);
+ change.snapshot->active_count--;
+ if (change.snapshot->active_count == 0)
{
- CommandCounterIncrement();
- UpdateActiveSnapshotCommandId();
+ if (change.snapshot == dstate->snapshot)
+ dstate->snapshot = NULL;
+ FreeSnapshot(change.snapshot);
}
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -2531,10 +2836,30 @@ static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
IndexInsertState *iistate, TupleTableSlot *index_slot)
{
+ Snapshot snapshot = change->snapshot;
List *recheck;
+ /*
+ * For INSERT, the visibility information is not important, but we use the
+ * snapshot to get CID. Index functions might need the whole snapshot
+ * anyway.
+ */
+ SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+ /*
+ * Write the tuple into the new heap.
+ *
+ * The snapshot is the one we used to decode the insert (though converted
+ * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+ * tuple CID incremented by one (due to the "new CID" WAL record that got
+ * written along with the INSERT record). Thus if we want to use the
+ * original CID, we need to subtract 1 from curcid.
+ */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
- simple_heap_insert(rel, tup);
+ heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+ HEAP_INSERT_NO_LOGICAL, NULL);
/*
* Update indexes.
@@ -2542,6 +2867,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
* In case functions in the index need the active snapshot and caller
* hasn't set one.
*/
+ PushActiveSnapshot(snapshot);
ExecStoreHeapTuple(tup, index_slot, false);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
@@ -2552,6 +2878,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
NIL, /* arbiterIndexes */
false /* onlySummarizing */
);
+ PopActiveSnapshot();
+ ResetRepackCurrentXids();
/*
* If recheck is required, it must have been preformed on the source
@@ -2569,18 +2897,36 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
TupleTableSlot *index_slot)
{
List *recheck;
+ LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ Snapshot snapshot = change->snapshot;
+ TM_FailureData tmfd;
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
+ *
+ * Regarding CID, see the comment in apply_concurrent_insert().
*/
- simple_heap_update(rel, &tup_target->t_self, tup, &update_indexes);
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_update(rel, &tup_target->t_self, tup,
+ change->xid, snapshot->curcid - 1,
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ /* wal_logical */
+ false);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
{
+ PushActiveSnapshot(snapshot);
recheck = ExecInsertIndexTuples(iistate->rri,
index_slot,
iistate->estate,
@@ -2590,6 +2936,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
NIL, /* arbiterIndexes */
/* onlySummarizing */
update_indexes == TU_Summarizing);
+ PopActiveSnapshot();
list_free(recheck);
}
@@ -2600,7 +2947,22 @@ static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
ConcurrentChange *change)
{
- simple_heap_delete(rel, &tup_target->t_self);
+ TM_Result res;
+ TM_FailureData tmfd;
+ Snapshot snapshot = change->snapshot;
+
+ /* Regarding CID, see the comment in apply_concurrent_insert(). */
+ Assert(snapshot->curcid != InvalidCommandId &&
+ snapshot->curcid > FirstCommandId);
+
+ res = heap_delete(rel, &tup_target->t_self, change->xid,
+ snapshot->curcid - 1, InvalidSnapshot, false,
+ &tmfd, false,
+ /* wal_logical */
+ false);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
}
@@ -2618,7 +2980,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
*/
static HeapTuple
find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
- IndexInsertState *iistate,
+ Snapshot snapshot, IndexInsertState *iistate,
TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
{
IndexScanDesc scan;
@@ -2627,7 +2989,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
HeapTuple result = NULL;
/* XXX no instrumentation for now */
- scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ scan = index_beginscan(rel, iistate->ident_index, snapshot,
NULL, nkeys, 0);
*scan_p = scan;
index_rescan(scan, key, nkeys, NULL, 0);
@@ -2699,6 +3061,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
}
PG_FINALLY();
{
+ ResetRepackCurrentXids();
+
if (rel_src)
rel_dst->rd_toastoid = InvalidOid;
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 00f7bbc5f59..25bb92b33f2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -469,9 +469,18 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
SnapBuild *builder = ctx->snapshot_builder;
/*
- * Check if REPACK CONCURRENTLY is being performed by this backend. If so,
- * only decode data changes of the table that it is processing, and the
- * changes of its TOAST relation.
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it. This is particularly important if the
+ * record was generated by REPACK CONCURRENTLY because this command uses
+ * the original XID when doing changes in the new storage. The decoding
+ * system probably does not expect to see the same transaction multiple
+ * times.
+ */
+
+ /*
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
*
* (TOAST locator should not be set unless the main is.)
*/
@@ -491,6 +500,61 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
return;
}
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * One particular problem we solve here is that REPACK CONCURRENTLY
+ * generates WAL when doing changes in the new table. Those changes should
+ * not be decoded because reorderbuffer.c considers their XID already
+ * committed. (REPACK CONCURRENTLY deliberately generates WAL records in
+ * such a way that they are skipped here.)
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * This does happen when 1) raw_heap_insert marks the TOAST
+ * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+ * replays inserts performed by other backends.
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
@@ -923,13 +987,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xlrec = (xl_heap_insert *) XLogRecGetData(r);
- /*
- * Ignore insert records without new tuples (this does happen when
- * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
- */
- if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
- return;
-
/* only interested in our database */
XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c32e459411b..fde4955c328 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
/* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
static void SnapBuildFreeSnapshot(Snapshot snap);
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
* Build a new snapshot, based on currently committed catalog-modifying
* transactions.
*
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
* In-progress transactions with catalog access are *not* allowed to modify
* these snapshots; they have to copy them and fill in appropriate ->curcid
* and ->subxip/subxcnt values.
*/
static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
{
Snapshot snapshot;
Size ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->snapXactCompletionCount = 0;
+ snapshot->lsn = lsn;
return snapshot;
}
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
if (TransactionIdIsValid(MyProc->xmin))
elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/*
* We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- snap = SnapBuildBuildSnapshot(builder);
+ snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
return SnapBuildMVCCFromHistoric(snap, false);
}
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
/* only build a new snapshot if we don't have a prebuilt one */
if (builder->snapshot == NULL)
{
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* increase refcount for the snapshot builder */
SnapBuildSnapIncRefcount(builder->snapshot);
}
@@ -1085,7 +1091,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
if (builder->snapshot)
SnapBuildSnapDecRefcount(builder->snapshot);
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
/* we might need to execute invalidations, add snapshot */
if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1910,7 +1916,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
{
SnapBuildSnapDecRefcount(builder->snapshot);
}
- builder->snapshot = SnapBuildBuildSnapshot(builder);
+ builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
SnapBuildSnapIncRefcount(builder->snapshot);
ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 687fbbc59bb..28bd16f9cc7 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
Relation relations[],
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
- ConcurrentChangeKind kind, HeapTuple tuple);
+ ConcurrentChangeKind kind, HeapTuple tuple,
+ TransactionId xid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
RepackDecodingState *dstate;
+ Snapshot snapshot;
dstate = (RepackDecodingState *) ctx->output_writer_private;
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (relation->rd_id != dstate->relid)
return;
+ /*
+ * Catalog snapshot is fine because the table we are processing is
+ * temporarily considered a user catalog table.
+ */
+ snapshot = GetCatalogSnapshot(InvalidOid);
+ Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+ Assert(!snapshot->suboverflowed);
+
+ /*
+ * This should not happen, but if we don't have enough information to
+ * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+ * to Assert().
+ */
+ if (XLogRecPtrIsInvalid(snapshot->lsn))
+ ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+ /*
+ * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+ * CID or a commit record of a catalog-changing transaction.
+ */
+ if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+ snapshot->curcid != dstate->snapshot->curcid)
+ {
+ /* CID should not go backwards. */
+ Assert(dstate->snapshot == NULL ||
+ snapshot->curcid >= dstate->snapshot->curcid ||
+ change->txn->xid != dstate->last_change_xid);
+
+ /*
+ * XXX Is it a problem that the copy is created in
+ * TopTransactionContext?
+ *
+ * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+ * to 0 instead of converting xip in this case? The point is that
+ * transactions which are still in progress from the perspective of
+ * reorderbuffer.c could not be replayed yet, so we do not need to
+ * examine their XIDs.
+ */
+ dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+ dstate->snapshot_lsn = snapshot->lsn;
+ }
+
/* Decode entry depending on its type */
switch (change->action)
{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
elog(ERROR, "Incomplete update info.");
if (oldtuple != NULL)
- store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+ change->txn->xid);
- store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+ change->txn->xid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
}
break;
default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple)
+ HeapTuple tuple, TransactionId xid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -266,6 +312,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
dst = dst_start + SizeOfConcurrentChange;
memcpy(dst, tuple->t_data, tuple->t_len);
+ /* Initialize the other fields. */
+ change.xid = xid;
+ change.snapshot = dstate->snapshot;
+ dstate->snapshot->active_count++;
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
@@ -279,6 +330,9 @@ store:
isnull[0] = false;
tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
values, isnull);
+#ifdef USE_ASSERT_CHECKING
+ dstate->last_change_xid = xid;
+#endif
/* Accounting. */
dstate->nchanges++;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e9ddf39500c..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -151,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -344,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 4eb67720737..14eda1c24ee 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1633,6 +1633,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = relid;
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a495f22876d..679cc6be1d1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1253,6 +1253,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bdeb2f83540..b0c6f1d916f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -325,21 +325,24 @@ extern BulkInsertState GetBulkInsertState(void);
extern void FreeBulkInsertState(BulkInsertState);
extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
- int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ CommandId cid, int options, BulkInsertState bistate);
extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
int ntuples, CommandId cid, int options,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
- CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ TransactionId xid, CommandId cid,
+ Snapshot crosscheck, bool wait,
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
- HeapTuple newtup,
+ HeapTuple newtup, TransactionId xid,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes,
+ bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..fbb66d559b6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
extern void SerializeTransactionState(Size maxsize, char *start_address);
extern void StartParallelWorkerTransaction(char *tstatespace);
extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
extern bool IsTransactionBlock(void);
extern bool IsTransactionOrTransactionBlock(void);
extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 569cc2184b3..ab1d9fc25dc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -73,6 +73,14 @@ typedef struct ConcurrentChange
/* See the enum above. */
ConcurrentChangeKind kind;
+ /* Transaction that changes the data. */
+ TransactionId xid;
+
+ /*
+ * Historic catalog snapshot that was used to decode this change.
+ */
+ Snapshot snapshot;
+
/*
* The actual tuple.
*
@@ -104,6 +112,8 @@ typedef struct RepackDecodingState
* tuplestore does this transparently.
*/
Tuplestorestate *tstore;
+ /* XID of the last change added to tstore. */
+ TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
/* The current number of changes in tstore. */
double nchanges;
@@ -124,6 +134,14 @@ typedef struct RepackDecodingState
/* Slot to retrieve data from tstore. */
TupleTableSlot *tsslot;
+ /*
+ * Historic catalog snapshot that was used to decode the most recent
+ * change.
+ */
+ Snapshot snapshot;
+ /* LSN of the record */
+ XLogRecPtr snapshot_lsn;
+
ResourceOwner resowner;
} RepackDecodingState;
@@ -148,5 +166,9 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d94fddd7cef..372065fc570 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -692,7 +695,9 @@ RelationCloseSmgr(Relation relation)
#define RelationIsAccessibleInLogicalDecoding(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
- (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+ (IsCatalogRelation(relation) || \
+ RelationIsUsedAsCatalogTable(relation) || \
+ (relation)->rd_repack_concurrent))
/*
* RelationIsLogicallyLogged
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..014f27db7d7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
#ifndef SNAPSHOT_H
#define SNAPSHOT_H
+#include "access/xlogdefs.h"
#include "lib/pairingheap.h"
@@ -201,6 +202,8 @@ typedef struct SnapshotData
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
+ XLogRecPtr lsn; /* position in the WAL stream when taken */
+
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e89db0a2ee7..e1e3e619c4b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2510,6 +2510,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
v12-0006-Add-regression-tests.patchtext/x-diffDownload
From 5f61e2c87a5a70f100d4f12ff45e8c3fe629ba3e Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 6/9] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 140 ++++++++++++++++++
6 files changed, 267 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e30c1a78e9c..3895013ca9d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -59,6 +59,7 @@
#include "utils/formatting.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3286,6 +3287,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..49a736ed617
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..5aa8983f98d
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,140 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(_xmin xid, _cmin cid, i int, j int);
+ CREATE TABLE data_s2(_xmin xid, _cmin cid, i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+#
+# xmin and cmin columns are used to check that we do not change tuple
+# visibility information. Since we do not expect xmin to stay unchanged across
+# test runs, it cannot appear in the output text. Instead, have each session
+# write the contents into a table and use FULL JOIN to check if the outputs
+# are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (_xmin, _cmin, i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(_xmin, _cmin, i, j)
+ SELECT xmin, cmin, i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.43.5
v12-0007-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From 15c63e21fa5fc312c0e9a40211f781555445c06f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 7/9] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 9 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 294 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fea683cb49c..d1fc39b98be 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11190,6 +11190,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9ee640e3517..0c250689d13 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -188,7 +188,14 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
(<xref linkend="logicaldecoding"/>) and applied before
the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
is typically held only for the time needed to swap the files, which
- should be pretty short.
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ea1d6f299b3..850708c7830 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1008,7 +1008,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3895013ca9d..085649716ac 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -103,6 +105,15 @@ static Oid repacked_rel = InvalidOid;
RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -149,7 +160,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -166,13 +178,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -2598,7 +2612,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -2628,6 +2643,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -2668,7 +2686,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2699,6 +2718,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -2823,10 +2845,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -3030,11 +3064,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -3043,10 +3081,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -3058,7 +3105,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -3068,6 +3115,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -3229,6 +3298,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Relation *ind_refs,
*ind_refs_p;
int nind;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3309,7 +3380,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3397,9 +3469,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..533fd40d383 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -39,8 +39,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2827,6 +2828,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ff56a1f0732..bc0217161ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -763,6 +763,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index ab1d9fc25dc..be283c70fce 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -153,7 +155,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
ClusterCommand cmd);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index 49a736ed617..f2728d94222 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 5aa8983f98d..0f45f9d2544 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -127,6 +127,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -138,3 +166,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.43.5
v12-0008-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From 6e2f23e1f8c22b03bbfb19fbc1ae510286cb9801 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 8/9] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
src/backend/access/transam/parallel.c | 8 ++
src/backend/access/transam/xact.c | 106 +++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 94 +++++++++++++---
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/standby.c | 4 +-
src/include/access/xlog.h | 15 ++-
src/include/commands/cluster.h | 1 +
src/include/utils/rel.h | 6 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
13 files changed, 206 insertions(+), 44 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3db4cac030e..608dc5c79bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -138,6 +139,12 @@ static TransactionId *ParallelCurrentXids;
static int nRepackCurrentXids = 0;
static TransactionId *RepackCurrentXids = NULL;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -650,6 +657,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -664,6 +672,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -693,20 +727,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -731,6 +751,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2245,6 +2313,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc30a52d496..ba758deefb4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 085649716ac..c2201b046bc 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -2204,7 +2204,16 @@ typedef struct RepackedRel
Oid dbid;
} RepackedRel;
-static HTAB *RepackedRelsHash = NULL;
+typedef struct RepackedRels
+{
+ /* Hashtable of RepackedRel elements. */
+ HTAB *hashtable;
+
+ /* The number of elements in the hashtable.. */
+ pg_atomic_uint32 nrels;
+} RepackedRels;
+
+static RepackedRels *repackedRels = NULL;
/*
* Maximum number of entries in the hashtable.
@@ -2217,22 +2226,38 @@ static HTAB *RepackedRelsHash = NULL;
Size
RepackShmemSize(void)
{
- return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+ Size result;
+
+ result = sizeof(RepackedRels);
+
+ result += hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+ return result;
}
void
RepackShmemInit(void)
{
+ bool found;
HASHCTL info;
+ repackedRels = ShmemInitStruct("Repacked Relations",
+ sizeof(RepackedRels),
+ &found);
+ if (!IsUnderPostmaster)
+ {
+ Assert(!found);
+ pg_atomic_init_u32(&repackedRels->nrels, 0);
+ }
+ else
+ Assert(found);
+
info.keysize = sizeof(RepackedRel);
info.entrysize = info.keysize;
-
- RepackedRelsHash = ShmemInitHash("Repacked Relations",
- MAX_REPACKED_RELS,
- MAX_REPACKED_RELS,
- &info,
- HASH_ELEM | HASH_BLOBS);
+ repackedRels->hashtable = ShmemInitHash("Repacked Relations Hash",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS);
}
/*
@@ -2267,13 +2292,14 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
*entry;
bool found;
static bool before_shmem_exit_callback_setup = false;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
relid = RelationGetRelid(rel);
index = index_p ? *index_p : NULL;
/*
- * Make sure that we do not leave an entry in RepackedRelsHash if exiting
- * due to FATAL.
+ * Make sure that we do not leave an entry in repackedRels->hashtable if
+ * exiting due to FATAL.
*/
if (!before_shmem_exit_callback_setup)
{
@@ -2288,7 +2314,7 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
*entered_p = false;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ hash_search(repackedRels->hashtable, &key, HASH_ENTER_NULL, &found);
if (found)
{
/*
@@ -2306,9 +2332,13 @@ begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+ /* Increment the number of relations. */
+ nrels = pg_atomic_fetch_add_u32(&repackedRels->nrels, 1);
+ Assert(nrels < MAX_REPACKED_RELS);
+
/*
- * Even if anything fails below, the caller has to do cleanup in the
- * shared memory.
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
*/
*entered_p = true;
@@ -2390,6 +2420,7 @@ end_concurrent_repack(bool error)
RepackedRel key;
RepackedRel *entry = NULL;
Oid relid = repacked_rel;
+ uint32 nrels PG_USED_FOR_ASSERTS_ONLY;
/* Remove the relation from the hash if we managed to insert one. */
if (OidIsValid(repacked_rel))
@@ -2398,7 +2429,8 @@ end_concurrent_repack(bool error)
key.relid = repacked_rel;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
- entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+ entry = hash_search(repackedRels->hashtable, &key, HASH_REMOVE,
+ NULL);
LWLockRelease(RepackedRelsLock);
/*
@@ -2427,6 +2459,10 @@ end_concurrent_repack(bool error)
* cluster_before_shmem_exit_callback().
*/
repacked_rel = InvalidOid;
+
+ /* Decrement the number of relations. */
+ nrels = pg_atomic_fetch_sub_u32(&repackedRels->nrels, 1);
+ Assert(nrels > 0);
}
/*
@@ -2479,6 +2515,8 @@ cluster_before_shmem_exit_callback(int code, Datum arg)
/*
* Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
*/
bool
is_concurrent_repack_in_progress(Oid relid)
@@ -2486,18 +2524,40 @@ is_concurrent_repack_in_progress(Oid relid)
RepackedRel key,
*entry;
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just use the counter.
+ */
+ if (!OidIsValid(relid))
+ {
+ if (pg_atomic_read_u32(&repackedRels->nrels) > 0)
+ return true;
+ else
+ return false;
+ }
+
+ /* For particular relation we need to search in the hashtable. */
memset(&key, 0, sizeof(key));
key.relid = relid;
key.dbid = MyDatabaseId;
LWLockAcquire(RepackedRelsLock, LW_SHARED);
entry = (RepackedRel *)
- hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+ hash_search(repackedRels->hashtable, &key, HASH_FIND, NULL);
LWLockRelease(RepackedRelsLock);
return entry != NULL;
}
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
+}
+
/*
* This function is much like pg_create_logical_replication_slot() except that
* the new slot is neither released (if anyone else could read changes from
@@ -2525,8 +2585,8 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
- * regarding logical decoding, treated like a catalog.
+ * table we are processing is present in repackedRels->hashtable and
+ * therefore, regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
NIL,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a8d2e024d34..4909432d585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5acb4508f85..413bcc1addb 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1313,13 +1313,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index be283c70fce..0267357a261 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -172,6 +172,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
extern Size RepackShmemSize(void);
extern void RepackShmemInit(void);
extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 372065fc570..fcbad5c1720 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -710,12 +710,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1e3e619c4b..b3be8572132 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2511,6 +2511,7 @@ ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
RepackedRel
+RepackedRels
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
v12-0009-Call-logical_rewrite_heap_tuple-when-applying-concur.patchtext/x-diffDownload
From 30837045beccac12ed48d2b60ed3fef35367fb3c Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 1 Apr 2025 13:48:57 +0200
Subject: [PATCH 9/9] Call logical_rewrite_heap_tuple() when applying
concurrent data changes.
This was implemented for the sake of completeness, but I think it's currently
not needed. Possible use cases could be:
1. REPACK CONCURRENTLY can process system catalogs.
System catalogs are scanned using a historic snapshot during logical decoding,
and the "combo CIDs" information is needed for that. Since "combo CID" is
associated with the "file locator" and that locator is changed by REPACK, this
command must record the information on individual tuples being moved from the
old file to the new one. This is what logical_rewrite_heap_tuple() does.
However, the logical decoding subsystem currently does not support decoding of
data changes in the system catalog. Therefore, the CONCURRENTLY option cannot
be used for system catalogs.
2. REPACK CONCURRENTLY is processing a relation, but once it has released all
the locks (in order to get the exclusive lock), another backend runs REPACK
CONCURRENTLY on the same table. Since the relation is treated as a system
catalog while these commands are processing it (so it can be scanned using a
historic snapshot during the "initial load"), it is important that the 2nd
backend does not break decoding of the "combo CIDs" performed by the 1st
backend.
However, it's not practical to let multiple backends run REPACK CONCURRENTLY
on the same relation, so we forbid that.
---
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/rewriteheap.c | 65 +++++-----
src/backend/commands/cluster.c | 113 +++++++++++++++---
src/backend/replication/logical/decode.c | 42 ++++++-
.../pgoutput_repack/pgoutput_repack.c | 21 ++--
src/include/access/rewriteheap.h | 5 +-
src/include/commands/cluster.h | 3 +
src/include/replication/reorderbuffer.h | 7 ++
8 files changed, 198 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 850708c7830..d7b0edc3bf8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -734,7 +734,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
/* Initialize the rewrite operation */
rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ *multi_cutoff, true);
/* Set up sorting if wanted */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6aa2ed214f2..83076b582d7 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -214,10 +214,8 @@ static void raw_heap_insert(RewriteState state, HeapTuple tup);
/* internal logical remapping prototypes */
static void logical_begin_heap_rewrite(RewriteState state);
-static void logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid, HeapTuple new_tuple);
static void logical_end_heap_rewrite(RewriteState state);
-
/*
* Begin a rewrite of a table
*
@@ -226,18 +224,19 @@ static void logical_end_heap_rewrite(RewriteState state);
* oldest_xmin xid used by the caller to determine which tuples are dead
* freeze_xid xid before which tuples will be frozen
* cutoff_multi multixact before which multis will be removed
+ * tid_chains need to maintain TID chains?
*
* Returns an opaque RewriteState, allocated in current memory context,
* to be used in subsequent calls to the other functions.
*/
RewriteState
begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
- TransactionId freeze_xid, MultiXactId cutoff_multi)
+ TransactionId freeze_xid, MultiXactId cutoff_multi,
+ bool tid_chains)
{
RewriteState state;
MemoryContext rw_cxt;
MemoryContext old_cxt;
- HASHCTL hash_ctl;
/*
* To ease cleanup, make a separate context that will contain the
@@ -262,29 +261,34 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
state->rs_cxt = rw_cxt;
state->rs_bulkstate = smgr_bulk_start_rel(new_heap, MAIN_FORKNUM);
- /* Initialize hash tables used to track update chains */
- hash_ctl.keysize = sizeof(TidHashKey);
- hash_ctl.entrysize = sizeof(UnresolvedTupData);
- hash_ctl.hcxt = state->rs_cxt;
-
- state->rs_unresolved_tups =
- hash_create("Rewrite / Unresolved ctids",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.entrysize = sizeof(OldToNewMappingData);
+ if (tid_chains)
+ {
+ HASHCTL hash_ctl;
+
+ /* Initialize hash tables used to track update chains */
+ hash_ctl.keysize = sizeof(TidHashKey);
+ hash_ctl.entrysize = sizeof(UnresolvedTupData);
+ hash_ctl.hcxt = state->rs_cxt;
+
+ state->rs_unresolved_tups =
+ hash_create("Rewrite / Unresolved ctids",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ hash_ctl.entrysize = sizeof(OldToNewMappingData);
+
+ state->rs_old_new_tid_map =
+ hash_create("Rewrite / Old to new tid map",
+ 128, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
- state->rs_old_new_tid_map =
- hash_create("Rewrite / Old to new tid map",
- 128, /* arbitrary initial size */
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ logical_begin_heap_rewrite(state);
MemoryContextSwitchTo(old_cxt);
- logical_begin_heap_rewrite(state);
-
return state;
}
@@ -303,12 +307,15 @@ end_heap_rewrite(RewriteState state)
* Write any remaining tuples in the UnresolvedTups table. If we have any
* left, they should in fact be dead, but let's err on the safe side.
*/
- hash_seq_init(&seq_status, state->rs_unresolved_tups);
-
- while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ if (state->rs_unresolved_tups)
{
- ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
- raw_heap_insert(state, unresolved->tuple);
+ hash_seq_init(&seq_status, state->rs_unresolved_tups);
+
+ while ((unresolved = hash_seq_search(&seq_status)) != NULL)
+ {
+ ItemPointerSetInvalid(&unresolved->tuple->t_data->t_ctid);
+ raw_heap_insert(state, unresolved->tuple);
+ }
}
/* Write the last page, if any */
@@ -995,7 +1002,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid,
* Perform logical remapping for a tuple that's mapped from old_tid to
* new_tuple->t_self by rewrite_heap_tuple() if necessary for the tuple.
*/
-static void
+void
logical_rewrite_heap_tuple(RewriteState state, ItemPointerData old_tid,
HeapTuple new_tuple)
{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index c2201b046bc..8472f6064ae 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -23,6 +23,7 @@
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/relscan.h"
+#include "access/rewriteheap.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "access/transam.h"
@@ -161,17 +162,21 @@ static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_update(Relation rel, HeapTuple tup,
HeapTuple tup_target,
ConcurrentChange *change,
IndexInsertState *iistate,
- TupleTableSlot *index_slot);
+ TupleTableSlot *index_slot,
+ RewriteState rwstate);
static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change);
+ ConcurrentChange *change,
+ RewriteState rwstate);
static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
HeapTuple tup_key,
Snapshot snapshot,
@@ -185,7 +190,8 @@ static bool process_concurrent_changes(LogicalDecodingContext *ctx,
ScanKey ident_key,
int ident_key_nentries,
IndexInsertState *iistate,
- struct timeval *must_complete);
+ struct timeval *must_complete,
+ RewriteState rwstate);
static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
@@ -2747,7 +2753,7 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
ScanKey key, int nkeys, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete, RewriteState rwstate)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2822,7 +2828,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
{
Assert(tup_old == NULL);
- apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot,
+ rwstate);
pfree(tup);
}
@@ -2830,7 +2837,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
change.kind == CHANGE_DELETE)
{
IndexScanDesc ind_scan = NULL;
- HeapTuple tup_key;
+ HeapTuple tup_key,
+ tup_exist_cp;
if (change.kind == CHANGE_UPDATE_NEW)
{
@@ -2872,11 +2880,23 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
if (tup_exist == NULL)
elog(ERROR, "Failed to find target tuple");
+ /*
+ * Update the mapping for xmax of the old version.
+ *
+ * Use a copy ('tup_exist' can point to shared buffer) with xmin
+ * invalid because mapping of that should have been written on
+ * insertion.
+ */
+ tup_exist_cp = heap_copytuple(tup_exist);
+ HeapTupleHeaderSetXmin(tup_exist_cp->t_data, InvalidTransactionId);
+ logical_rewrite_heap_tuple(rwstate, change.old_tid, tup_exist_cp);
+ pfree(tup_exist_cp);
+
if (change.kind == CHANGE_UPDATE_NEW)
apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
- index_slot);
+ index_slot, rwstate);
else
- apply_concurrent_delete(rel, tup_exist, &change);
+ apply_concurrent_delete(rel, tup_exist, &change, rwstate);
ResetRepackCurrentXids();
@@ -2929,9 +2949,12 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
static void
apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
- IndexInsertState *iistate, TupleTableSlot *index_slot)
+ IndexInsertState *iistate, TupleTableSlot *index_slot,
+ RewriteState rwstate)
{
+ HeapTupleHeader tup_hdr = tup->t_data;
Snapshot snapshot = change->snapshot;
+ ItemPointerData old_tid;
List *recheck;
/*
@@ -2941,6 +2964,9 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
*/
SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+ /* Remember location in the old heap. */
+ ItemPointerCopy(&tup_hdr->t_ctid, &old_tid);
+
/*
* Write the tuple into the new heap.
*
@@ -2956,6 +2982,14 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
HEAP_INSERT_NO_LOGICAL, NULL);
+ /*
+ * Update the mapping for xmin. (xmax should be invalid). This is needed
+ * because, during the processing, the table is considered an "user
+ * catalog".
+ */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, old_tid, tup);
+
/*
* Update indexes.
*
@@ -2989,15 +3023,23 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
static void
apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
ConcurrentChange *change, IndexInsertState *iistate,
- TupleTableSlot *index_slot)
+ TupleTableSlot *index_slot, RewriteState rwstate)
{
List *recheck;
LockTupleMode lockmode;
TU_UpdateIndexes update_indexes;
+ ItemPointerData tid_new_old_heap,
+ tid_old_new_heap;
TM_Result res;
Snapshot snapshot = change->snapshot;
TM_FailureData tmfd;
+ /* Location of the new tuple in the old heap. */
+ ItemPointerCopy(&tup->t_data->t_ctid, &tid_new_old_heap);
+
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
/*
* Write the new tuple into the new heap. ('tup' gets the TID assigned
* here.)
@@ -3007,7 +3049,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_update(rel, &tup_target->t_self, tup,
+ res = heap_update(rel, &tid_old_new_heap, tup,
change->xid, snapshot->curcid - 1,
InvalidSnapshot,
false, /* no wait - only we are doing changes */
@@ -3017,6 +3059,10 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
if (res != TM_Ok)
ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+ /* Update the mapping for xmin of the new version. */
+ Assert(!TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tup->t_data)));
+ logical_rewrite_heap_tuple(rwstate, tid_new_old_heap, tup);
+
ExecStoreHeapTuple(tup, index_slot, false);
if (update_indexes != TU_None)
@@ -3040,8 +3086,9 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
static void
apply_concurrent_delete(Relation rel, HeapTuple tup_target,
- ConcurrentChange *change)
+ ConcurrentChange *change, RewriteState rwstate)
{
+ ItemPointerData tid_old_new_heap;
TM_Result res;
TM_FailureData tmfd;
Snapshot snapshot = change->snapshot;
@@ -3050,7 +3097,10 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
Assert(snapshot->curcid != InvalidCommandId &&
snapshot->curcid > FirstCommandId);
- res = heap_delete(rel, &tup_target->t_self, change->xid,
+ /* Location of the existing tuple in the new heap. */
+ ItemPointerCopy(&tup_target->t_self, &tid_old_new_heap);
+
+ res = heap_delete(rel, &tid_old_new_heap, change->xid,
snapshot->curcid - 1, InvalidSnapshot, false,
&tmfd, false,
/* wal_logical */
@@ -3132,7 +3182,8 @@ static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
int ident_key_nentries, IndexInsertState *iistate,
- struct timeval *must_complete)
+ struct timeval *must_complete,
+ RewriteState rwstate)
{
RepackDecodingState *dstate;
@@ -3165,7 +3216,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate, must_complete);
+ ident_key_nentries, iistate, must_complete,
+ rwstate);
}
PG_FINALLY();
{
@@ -3350,6 +3402,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Oid ident_idx_old,
ident_idx_new;
IndexInsertState *iistate;
+ RewriteState rwstate;
ScanKey ident_key;
int ident_key_nentries;
XLogRecPtr wal_insert_ptr,
@@ -3437,11 +3490,27 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
* Apply concurrent changes first time, to minimize the time we need to
* hold AccessExclusiveLock. (Quite some amount of WAL could have been
* written during the data copying and index creation.)
+ *
+ * Now we are processing individual tuples, so pass false for
+ * 'tid_chains'. Since rwstate is now only needed for
+ * logical_begin_heap_rewrite(), none of the transaction IDs needs to be
+ * valid.
*/
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- NULL);
+ NULL, rwstate);
+
+ /*
+ * OldHeap will be closed, so we need to initialize rwstate again for the
+ * next call of process_concurrent_changes().
+ */
+ end_heap_rewrite(rwstate);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3529,6 +3598,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ InvalidTransactionId,
+ false);
/*
* This time we have the exclusive lock on the table, so make sure that
@@ -3558,11 +3632,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
ident_key, ident_key_nentries, iistate,
- t_end_ptr))
+ t_end_ptr, rwstate))
ereport(ERROR,
(errmsg("could not process concurrent data changes in time"),
errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+ end_heap_rewrite(rwstate);
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 25bb92b33f2..6f4a5f5b95b 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -984,11 +984,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_insert *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
+ HeapTupleHeader tuphdr;
xlrec = (xl_heap_insert *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1013,6 +1015,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, blknum, xlrec->offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1034,11 +1043,15 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferChange *change;
char *data;
RelFileLocator target_locator;
+ BlockNumber old_blknum,
+ new_blknum;
xlrec = (xl_heap_update *) XLogRecGetData(r);
+ /* Retrieve blknum, so that we can compose CTID below. */
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &new_blknum);
+
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1055,6 +1068,7 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
Size datalen;
Size tuplelen;
+ HeapTupleHeader tuphdr;
data = XLogRecGetBlockData(r, 0, &datalen);
@@ -1064,6 +1078,13 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
ReorderBufferAllocTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(data, datalen, change->data.tp.newtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ tuphdr = change->data.tp.newtuple->t_data;
+ ItemPointerSet(&tuphdr->t_ctid, new_blknum, xlrec->new_offnum);
}
if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD)
@@ -1082,6 +1103,14 @@ DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple(data, datalen, change->data.tp.oldtuple);
}
+ /*
+ * Remember the old tuple CTID, for the sake of
+ * logical_rewrite_heap_tuple().
+ */
+ if (!XLogRecGetBlockTagExtended(r, 1, NULL, NULL, &old_blknum, NULL))
+ old_blknum = new_blknum;
+ ItemPointerSet(&change->data.tp.old_tid, old_blknum, xlrec->old_offnum);
+
change->data.tp.clear_toast_afterwards = true;
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
@@ -1100,11 +1129,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
xl_heap_delete *xlrec;
ReorderBufferChange *change;
RelFileLocator target_locator;
+ BlockNumber blknum;
xlrec = (xl_heap_delete *) XLogRecGetData(r);
/* only interested in our database */
- XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
+ XLogRecGetBlockTag(r, 0, &target_locator, NULL, &blknum);
if (target_locator.dbOid != ctx->slot->data.database)
return;
@@ -1136,6 +1166,12 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
datalen, change->data.tp.oldtuple);
+
+ /*
+ * CTID is needed for logical_rewrite_heap_tuple(), when doing REPACK
+ * CONCURRENTLY.
+ */
+ ItemPointerSet(&change->data.tp.old_tid, blknum, xlrec->offnum);
}
change->data.tp.clear_toast_afterwards = true;
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 28bd16f9cc7..24d9c9c4884 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -33,7 +33,7 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static void store_change(LogicalDecodingContext *ctx,
ConcurrentChangeKind kind, HeapTuple tuple,
- TransactionId xid);
+ TransactionId xid, ItemPointer old_tid);
void
_PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -168,7 +168,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (newtuple == NULL)
elog(ERROR, "Incomplete insert info.");
- store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
+ store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid,
+ NULL);
}
break;
case REORDER_BUFFER_CHANGE_UPDATE:
@@ -186,10 +187,10 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple != NULL)
store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
- change->txn->xid);
+ change->txn->xid, NULL);
store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
- change->txn->xid);
+ change->txn->xid, &change->data.tp.old_tid);
}
break;
case REORDER_BUFFER_CHANGE_DELETE:
@@ -202,7 +203,8 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (oldtuple == NULL)
elog(ERROR, "Incomplete delete info.");
- store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
+ store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid,
+ &change->data.tp.old_tid);
}
break;
default:
@@ -236,13 +238,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (i == nrelations)
return;
- store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
+ store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId, NULL);
}
/* Store concurrent data change. */
static void
store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
- HeapTuple tuple, TransactionId xid)
+ HeapTuple tuple, TransactionId xid, ItemPointer old_tid)
{
RepackDecodingState *dstate;
char *change_raw;
@@ -317,6 +319,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
change.snapshot = dstate->snapshot;
dstate->snapshot->active_count++;
+ if (old_tid)
+ ItemPointerCopy(old_tid, &change.old_tid);
+ else
+ ItemPointerSetInvalid(&change.old_tid);
+
/* The data has been copied. */
if (flattened)
pfree(tuple);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 99c3f362adc..eebda35c7cb 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,11 +23,14 @@ typedef struct RewriteStateData *RewriteState;
extern RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap,
TransactionId oldest_xmin, TransactionId freeze_xid,
- MultiXactId cutoff_multi);
+ MultiXactId cutoff_multi, bool tid_chains);
extern void end_heap_rewrite(RewriteState state);
extern void rewrite_heap_tuple(RewriteState state, HeapTuple old_tuple,
HeapTuple new_tuple);
extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple);
+extern void logical_rewrite_heap_tuple(RewriteState state,
+ ItemPointerData old_tid,
+ HeapTuple new_tuple);
/*
* On-Disk data format for an individual logical rewrite mapping.
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 0267357a261..45cd3fe4276 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -78,6 +78,9 @@ typedef struct ConcurrentChange
/* Transaction that changes the data. */
TransactionId xid;
+ /* For UPDATE / DELETE, the location of the old tuple version. */
+ ItemPointerData old_tid;
+
/*
* Historic catalog snapshot that was used to decode this change.
*/
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3be0cbd7ebe..c2731947b22 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -104,6 +104,13 @@ typedef struct ReorderBufferChange
HeapTuple oldtuple;
/* valid for INSERT || UPDATE */
HeapTuple newtuple;
+
+ /*
+ * REPACK CONCURRENTLY needs the old TID, even if the old tuple
+ * itself is not WAL-logged (i.e. when the identity key does not
+ * change).
+ */
+ ItemPointerData old_tid;
} tp;
/*
--
2.43.5
On Tue, Apr 1, 2025, at 10:31 AM, Antonin Houska wrote:
One more version, hopefully to make cfbot happy (I missed the bug because I
did not set the RELCACHE_FORCE_RELEASE macro in my environment.)
I started reviewing this patch. It was in my radar to review it but I didn't
have spare time until now.
The syntax is fine for me. It is really close to CLUSTER syntax but it adds one
detail: INDEX keyword. It is a good approach because USING clause can be
expanded in the future if required.
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>cluster a table according to an index</refpurpose>
+ </refnamediv>
This description is not accurate because the index is optional. It means if the
index is not specified, it is a "rewrite" instead of a "cluster". One
suggestion is to use "rewrite" because it says nothing about the tuple order. A
"rewrite a table" or "rewrite a table to reclaim space" are good candidates. On
the other hand, the command is called "repack" and it should be a strong
candidate for the verb in this description. (I'm surprised that repack is not a
recent term [1]https://www.merriam-webster.com/dictionary/repack). It seems a natural choice.
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
You missed a space after INDEX.
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
If we decide for another term (other than "repacked") then it should reflect
here and in some other parts ("repacking" is also used) too.
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order.
That's true for CLUSTER but not for REPACK. Index is optional.
+ Prints a progress report as each table is clustered
+ at <literal>INFO</literal> level.
s/clustered/repacked/?
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is clustered using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
Ditto.
+ <para>
+ Cluster the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal>:
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
It sounds strange to use "Repack" in the other examples but this one it says
"Cluster". Let's use the same terminology.
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
The warnings, notes, and tips are usually placed *after* the description.
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Is it worth to rename table_relation_copy_for_cluster() and
heapam_relation_copy_for_cluster() to replace cluster with repack?
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
Do you really need command? IIUC REPACK is the only command that will used by
this view. There is no need to differentiate commands here.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
command.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
pg_index is a local catalog. To be consistent while clustering a shared
catalog, it should set indisclustered in all existent databases because in each
pg_index table there is a tuple for the referred index. As the comment says it
is not possible.
euler=# select relname, relkind, pg_relation_filepath(oid) from pg_class where relname = 'pg_index';
relname | relkind | pg_relation_filepath
----------+---------+----------------------
pg_index | r | base/16424/2610
(1 row)
euler=# select indexrelid::regclass, indexrelid::regclass, indisclustered from pg_index where indrelid = 'pg_database'::regclass;
indexrelid | indexrelid | indisclustered
---------------------------+---------------------------+----------------
pg_database_datname_index | pg_database_datname_index | f
pg_database_oid_index | pg_database_oid_index | f
(2 rows)
euler=# \c postgres
You are now connected to database "postgres" as user "euler".
postgres=# select relname, relkind, pg_relation_filepath(oid) from pg_class where relname = 'pg_index';
relname | relkind | pg_relation_filepath
----------+---------+----------------------
pg_index | r | base/5/2610
(1 row)
postgres=# select indexrelid::regclass, indexrelid::regclass, indisclustered from pg_index where indrelid = 'pg_database'::regclass;
indexrelid | indexrelid | indisclustered
---------------------------+---------------------------+----------------
pg_database_datname_index | pg_database_datname_index | f
pg_database_oid_index | pg_database_oid_index | f
(2 rows)
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster a shared catalog")));
+ errmsg("cannot %s a shared catalog", cmd_str)));
I'm confused about this change. Why is it required?
If it prints this message only for CLUSTER command, you don't need to have a
generic message. This kind of message is not good for translation. If you need
multiple verbs here, I advise you to break it into multiple messages.
- {
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot %s temporary tables of other sessions",
+ cmd_str)));
Ditto.
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
If the idea is to remove CLUSTER and VACUUM from this routine in the future, I
wouldn't include formatting.h just for asc_toupper(). Instead, I would use an
if condition. I think it will be easy to remove this code path when the time
comes.
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on index \"%s\" because access method does not support clustering",
+ cmd_str, RelationGetRelationName(OldIndex))));
Ditto. I don't think check_index_is_clusterable() should be changed. The action
is "cluster" independently of the command. You can keep "cluster" until we
completely remove CLUSTER command and then we can replace this term with
"repack". It also applies to cluster_is_permitted_for_relation().
- errmsg("cannot cluster on partial index \"%s\"",
+ errmsg("cannot %s on partial index \"%s\"",
+ cmd_str,
RelationGetRelationName(OldIndex))));
Ditto.
- errmsg("cannot cluster on invalid index \"%s\"",
- RelationGetRelationName(OldIndex))));
+ errmsg("cannot %s on invalid index \"%s\"",
+ cmd_str, RelationGetRelationName(OldIndex))));
Ditto.
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
+ (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap),
RelationGetRelationName(OldIndex))));
This is bad for translation. Use complete sentences.
- (errmsg("clustering \"%s.%s\" using sequential scan and sort",
+ (errmsg("%sing \"%s.%s\" using sequential scan and sort",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
Ditto.
- (errmsg("vacuuming \"%s.%s\"",
+ (errmsg("%sing \"%s.%s\"",
+ cmd_str,
nspname,
RelationGetRelationName(OldHeap))));
Ditto.
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
Since the goal is to remove CLUSTER in the future, provide a comment that
doesn't mention routines that will certainly be removed. Hence, there is no
need to fix them in the future.
+ /*
+ * Get all indexes that have indisclustered set and that the current user
+ * has the appropriate privileges for.
+ */
This comment is not true.
ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
+ (errmsg("permission denied to %s \"%s\", skipping it",
+ CLUSTER_COMMAND_STR(cmd),
get_rel_name(relid))));
Fix for translation.
+ if (stmt->relation != NULL)
+ {
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ CLUSTER_COMMAND_REPACK, ¶ms,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
This code path is confusing. It took me some time (after reading
process_single_relation() that could have a better name) to understand it. I
don't have a good suggestion but it should have at least one comment explaining
what the purpose is.
+/*
+ * REPACK a single relation.
+ *
+ * Return NULL if done, relation reference if the caller needs to process it
+ * (because the relation is partitioned).
+ */
This comment should be expanded. As I said in the previous hunk, there isn't
sufficient information to understand how process_single_relation() works.
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
I'm wondering if there is an easy way to avoid these rules.
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
It is just a matter of style but I have the habit to include new stuff at the
end.
+-- Yet another code path: REPACK w/o index.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
You forgot to remove the USING INDEX here.
I'm still review the other patches (that is basically the implementation of
CONCURRENTLY) and to avoid a long review, I'm sending the 0001 review. Anyway,
0001 is independent of the other patches and should be applied separately.
[1]: https://www.merriam-webster.com/dictionary/repack
--
Euler Taveira
EDB https://www.enterprisedb.com/
Hi,
On Tue, Apr 1, 2025 at 10:31 AM Antonin Houska <ah@cybertec.at> wrote:
One more version, hopefully to make cfbot happy (I missed the bug because I
did not set the RELCACHE_FORCE_RELEASE macro in my environment.)
Thanks for the new version! I'm starting to study this patch series and
I just want to share some points about the documentation on v12-0004:
+ Thus the tuples inserted into the old file during the copying are
+ also stored in separately in a temporary file, so they can eventually
Maybe "stored separately in a temporary file"?
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently processing the DML commands that
+ other transactions executed during any of the preceding phase.
+ </entry>
+ </row>
This catch-up phase only happens when CONCURRENTLY is used right? Maybe
it would be good to mention this?
The commit message say:
"Of course, more data changes can take place while we are waiting for
the lock - these will be applied to the new file after we have acquired
the lock, before we swap the files."
But the documentation say:
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short.
+ </para>
I've noticed that you've included on 0007 that the ACCESS EXCLUSIVE may
be used to apply changes that occurred while waiting for the lock, but
IIUC this behaviour is implemented on 0004 right? If that's the case I
think that it would be good to move this part of the documentation to
0004 instead of 0007, what do you think?
--
Matheus Alcantara
On 2025-Apr-01, Antonin Houska wrote:
Besides that, it occurred to me that 0005 ("Preserve visibility
information of the concurrent data changes.") will probably introduce
significant overhead. The problem is that the table we're repacking is
treated like a catalog, for reorderbuffer.c to generate snapshots that
we need to replay UPDATE / DELETE commands on the new table.contrib/test_decoding can be used to demonstrate the difference
between ordinary and catalog tables:[.. ordinary ..]
Execution Time: 3521.190 ms
[.. catalog ..]
Execution Time: 6561.634 ms
Significant indeed. Thinking about the scenarios in which I envision
people using REPACK CONCURRENTLY (mostly, cases where very large tables
have accumulated considerable amounts of bloat) and considering the size
of the patch, I think the case for treating it as concurrent-safe is not
credible, at least not at this stage -- not only because of this
performance impact, but also because of the additional code complexity,
which I'm really doubtful we can address at this stage. I would suggest
to put that patch aside for now, maybe with a doc warning that
"repacking a table would cause visibility information to be lost"; and
then address that aspect later on, after this feature has gone through
some battle-hardening.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Las navajas y los monos deben estar siempre distantes" (Germán Poo)
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Apr-01, Antonin Houska wrote:
Besides that, it occurred to me that 0005 ("Preserve visibility
information of the concurrent data changes.") will probably introduce
significant overhead. The problem is that the table we're repacking is
treated like a catalog, for reorderbuffer.c to generate snapshots that
we need to replay UPDATE / DELETE commands on the new table.contrib/test_decoding can be used to demonstrate the difference
between ordinary and catalog tables:[.. ordinary ..]
Execution Time: 3521.190 ms
[.. catalog ..]
Execution Time: 6561.634 msSignificant indeed. Thinking about the scenarios in which I envision
people using REPACK CONCURRENTLY (mostly, cases where very large tables
have accumulated considerable amounts of bloat) and considering the size
of the patch, I think the case for treating it as concurrent-safe is not
credible, at least not at this stage -- not only because of this
performance impact, but also because of the additional code complexity,
which I'm really doubtful we can address at this stage. I would suggest
to put that patch aside for now, maybe with a doc warning that
"repacking a table would cause visibility information to be lost"; and
then address that aspect later on, after this feature has gone through
some battle-hardening.
ok, I'll adjust the patch set.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Euler Taveira <euler@eulerto.com> wrote:
I started reviewing this patch. It was in my radar to review it but I didn't
have spare time until now.
Thanks!
+ <refnamediv> + <refname>REPACK</refname> + <refpurpose>cluster a table according to an index</refpurpose> + </refnamediv>This description is not accurate because the index is optional. It means if the
index is not specified, it is a "rewrite" instead of a "cluster". One
suggestion is to use "rewrite" because it says nothing about the tuple order. A
"rewrite a table" or "rewrite a table to reclaim space" are good candidates. On
the other hand, the command is called "repack" and it should be a strong
candidate for the verb in this description. (I'm surprised that repack is not a
recent term [1]). It seems a natural choice.
This reveals that I used the documentation of CLUSTER for "inspiration" :-) I
prefer "rewrite" because it can help users which don't know what REPACK means
in this context.
+<synopsis> +REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
ok, fixed
You missed a space after INDEX.
+ <para> + Because the planner records statistics about the ordering of tables, it is + advisable to + run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the + newly repacked table. Otherwise, the planner might make poor choices of + query plans. + </para>If we decide for another term (other than "repacked") then it should reflect
here and in some other parts ("repacking" is also used) too.
I think it's ok to say "repack" as long as we explained it above. The
documentation of CLUSTER command also uses the verb "cluster".
+ <para> + When an index scan or a sequential scan without sort is used, a temporary + copy of the table is created that contains the table data in the index + order.That's true for CLUSTER but not for REPACK. Index is optional.
ok, removed the mention of index.
+ Prints a progress report as each table is clustered + at <literal>INFO</literal> level.s/clustered/repacked/?
right
+ Repacking a partitioned table repacks each of its partitions. If an index + is specified, each partition is clustered using the partition of that + index. <command>REPACK</command> on a partitioned table cannot be executed + inside a transaction block.Ditto.
fixed
+ <para> + Cluster the table <literal>employees</literal> on the basis of its + index <literal>employees_ind</literal>: +<programlisting> +REPACK employees USING INDEX employees_ind; +</programlisting> + </para>It sounds strange to use "Repack" in the other examples but this one it says
"Cluster". Let's use the same terminology.
It's explained above on the page that if index is specified, it's
clustering. I changed it to "Repack", but added a note that this is
effectively clustering.
+ + <warning> + <para> + The <command>FULL</command> parameter is deprecated in favor of + <xref linkend="sql-repack"/>. + </para> + </warning> +The warnings, notes, and tips are usually placed *after* the description.
You probably mean the subsecions "Notes on Clustering" and "Notes on
Resources". I moved them into the "Notes" section.
--- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,Is it worth to rename table_relation_copy_for_cluster() and
heapam_relation_copy_for_cluster() to replace cluster with repack?
I had thought about it and concluded that it'd make the patch too
invasive. Note that the CLUSTER still uses these functions. We can do the
renaming when removing the CLUSTER command someday.
+ SELECT + S.pid AS pid, + S.datid AS datid, + D.datname AS datname, + S.relid AS relid, + CASE S.param1 WHEN 1 THEN 'REPACK' + END AS command,Do you really need command? IIUC REPACK is the only command that will used by
this view. There is no need to differentiate commands here.
REPACK is a regular command, so why shouldn't it have its view? Just like
CLUSTER has one (pg_stat_progress_cluster).
+ * + * 'cmd' indicates which commands is being executed. REPACK should be the only + * caller of this function in the future.command.
Not sure I understand this comment.
+ * + * REPACK does not set indisclustered. XXX Not sure I understand the + * comment above: how can an attribute be set "only in the current + * database"? */pg_index is a local catalog. To be consistent while clustering a shared
catalog, it should set indisclustered in all existent databases because in each
pg_index table there is a tuple for the referred index. As the comment says it
is not possible.
Yes, Alvaro already explained this to me [1]/messages/by-id/202503031807.dnacvpgnjkz7@alvherre.pgsql :-)
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared) + if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot cluster a shared catalog"))); + errmsg("cannot %s a shared catalog", cmd_str)));I'm confused about this change. Why is it required?
If it prints this message only for CLUSTER command, you don't need to have a
generic message. This kind of message is not good for translation. If you need
multiple verbs here, I advise you to break it into multiple messages.
Good point, I didn't think about translation. Fixed.
- { - if (OidIsValid(indexOid)) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot cluster temporary tables of other sessions"))); - else - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot vacuum temporary tables of other sessions"))); - } + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot %s temporary tables of other sessions", + cmd_str)));Ditto.
Fixed.
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM"); + CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));If the idea is to remove CLUSTER and VACUUM from this routine in the future, I
wouldn't include formatting.h just for asc_toupper(). Instead, I would use an
if condition. I think it will be easy to remove this code path when the time
comes.
Fixed.
- errmsg("cannot cluster on index \"%s\" because access method does not support clustering", - RelationGetRelationName(OldIndex)))); + errmsg("cannot %s on index \"%s\" because access method does not support clustering", + cmd_str, RelationGetRelationName(OldIndex))));Ditto. I don't think check_index_is_clusterable() should be changed. The action
is "cluster" independently of the command. You can keep "cluster" until we
completely remove CLUSTER command and then we can replace this term with
"repack". It also applies to cluster_is_permitted_for_relation().- errmsg("cannot cluster on partial index \"%s\"", + errmsg("cannot %s on partial index \"%s\"", + cmd_str, RelationGetRelationName(OldIndex))));Ditto.
- errmsg("cannot cluster on invalid index \"%s\"", - RelationGetRelationName(OldIndex)))); + errmsg("cannot %s on invalid index \"%s\"", + cmd_str, RelationGetRelationName(OldIndex))));Ditto.
- (errmsg("clustering \"%s.%s\" using index scan on \"%s\"", + (errmsg("%sing \"%s.%s\" using index scan on \"%s\"", + cmd_str, nspname, RelationGetRelationName(OldHeap), RelationGetRelationName(OldIndex))));This is bad for translation. Use complete sentences.
- (errmsg("clustering \"%s.%s\" using sequential scan and sort", + (errmsg("%sing \"%s.%s\" using sequential scan and sort", + cmd_str, nspname, RelationGetRelationName(OldHeap))));Ditto.
- (errmsg("vacuuming \"%s.%s\"", + (errmsg("%sing \"%s.%s\"", + cmd_str, nspname, RelationGetRelationName(OldHeap))));Ditto.
fixed
/* - * Given an index on a partitioned table, return a list of RelToCluster for + * Like get_tables_to_cluster(), but do not care about indexes. + */
Since the goal is to remove CLUSTER in the future, provide a comment that
doesn't mention routines that will certainly be removed. Hence, there is no
need to fix them in the future.
It'd be almost duplicate of the header comment of get_tables_to_cluster() and
I don't like duplication. Let's do that at removal time.
+ /* + * Get all indexes that have indisclustered set and that the current user + * has the appropriate privileges for. + */This comment is not true.
Fixed.
ereport(WARNING, - (errmsg("permission denied to cluster \"%s\", skipping it", + (errmsg("permission denied to %s \"%s\", skipping it", + CLUSTER_COMMAND_STR(cmd), get_rel_name(relid))));Fix for translation.
Fixed.
+ if (stmt->relation != NULL) + { + rel = process_single_relation(stmt->relation, stmt->indexname, + CLUSTER_COMMAND_REPACK, ¶ms, + &indexOid); + if (rel == NULL) + return; + }This code path is confusing. It took me some time (after reading
process_single_relation() that could have a better name) to understand it. I
don't have a good suggestion but it should have at least one comment explaining
what the purpose is.
ok, added the comment that I lost when moving the code from cluster() to
process_single_relation().
+/* + * REPACK a single relation. + * + * Return NULL if done, relation reference if the caller needs to process it + * (because the relation is partitioned). + */This comment should be expanded. As I said in the previous hunk, there isn't
sufficient information to understand how process_single_relation() works.
This function only contains code that I moved from cluster_rel(). The header
comment is and additional information. I tried to rephrase it a bit anyway.
+ | REPACK + { + RepackStmt *n = makeNode(RepackStmt); + + n->relation = NULL; + n->indexname = NULL; + n->params = NIL; + $$ = (Node *) n; + } + + | REPACK '(' utility_option_list ')' + { + RepackStmt *n = makeNode(RepackStmt); + + n->relation = NULL; + n->indexname = NULL; + n->params = $3; + $$ = (Node *) n; + }
Maybe, will think about it.
I'm wondering if there is an easy way to avoid these rules.
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
+ PROGRESS_COMMAND_REPACK,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,It is just a matter of style but I have the habit to include new stuff at the
end.
Yes, it seems so. Fixed.
+-- Yet another code path: REPACK w/o index. +REPACK clstr_tst USING INDEX clstr_tst_c; +-- Verify that inheritance link still worksYou forgot to remove the USING INDEX here.
Good catch. I removed the test because w/o index the output order can be
unstable. (Whether index is used or not should not affect the catalog changes
related to inheritance or FKs anyway.)
I'm still review the other patches (that is basically the implementation of
CONCURRENTLY) and to avoid a long review, I'm sending the 0001 review. Anyway,
0001 is independent of the other patches and should be applied separately.
Attached is a new version of 0001.
As for the other patches, please skip the parts > 0004 - most of this code
will be removed [2]/messages/by-id/13028.1743762516@localhost. I'll try to post the next version of the patch set next
week.
(Regarding next reviews, please try to keep hunk headers in the text.)
[1]: /messages/by-id/202503031807.dnacvpgnjkz7@alvherre.pgsql
[2]: /messages/by-id/13028.1743762516@localhost
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
0001-Add-REPACK-command.patchtext/x-diffDownload
From af419c5d5f56429581a263f12f70d12144a5a0e9 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 4 Apr 2025 18:30:42 +0200
Subject: [PATCH] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 230 +++++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 79 +----
doc/src/sgml/ref/repack.sgml | 256 ++++++++++++++
doc/src/sgml/ref/vacuum.sgml | 8 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 27 ++
src/backend/commands/cluster.c | 416 +++++++++++++++++------
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 63 +++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 19 +-
src/include/commands/progress.h | 60 +++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 123 +++++++
src/test/regress/expected/rules.out | 27 ++
src/test/regress/sql/cluster.sql | 59 ++++
src/tools/pgindent/typedefs.list | 2 +
25 files changed, 1259 insertions(+), 207 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a6d67d2fbaa..0a6229c391a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5940,6 +5948,228 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>command</structfield> <type>text</type>
+ </para>
+ <para>
+ The command that is running. Currently, the only value
+ is <literal>REPACK</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..54bb2362c84 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,17 +42,23 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
<para>
When a table is clustered, <productname>PostgreSQL</productname>
@@ -136,63 +142,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..2fcbd75106f
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,256 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+[ <replaceable class="parameter">table_name</replaceable> [ USING INDEX
+<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is repacked
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is repacked using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data. Temporary
+ copies of each index on the table are created as well. Therefore, you need
+ free space on disk at least equal to the sum of the table size and the
+ index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+ <para>
+ Repack the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal> (Since index is used here, this is
+ effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..735a2a7703a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..d91e66241fb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..1d2ea145fe7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1263,6 +1263,33 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ CASE S.param1 WHEN 1 THEN 'REPACK'
+ END AS command,
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..dab6499127e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,17 +67,21 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
+ Oid relid, bool rel_is_index);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params,
+ ClusterCommand cmd,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -134,71 +138,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_CLUSTER,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -231,7 +175,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +188,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +205,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +228,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +251,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which commands is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -323,13 +272,26 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -403,8 +365,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
@@ -415,21 +381,19 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- if (OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
- else
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot vacuum temporary tables of other sessions")));
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot cluster temporary tables of other sessions")));
}
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap,
+ (cmd == CLUSTER_COMMAND_CLUSTER ?
+ "CLUSTER" : (cmd == CLUSTER_COMMAND_REPACK ?
+ "REPACK" : "VACUUM")));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
@@ -1458,8 +1422,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1473,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1687,14 +1651,66 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all relations that the current user has the appropriate privileges
+ * for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1718,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1752,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId()))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1752,3 +1784,179 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid)
get_rel_name(relid))));
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_REPACK,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation if it's a non-partitioned table or a leaf
+ * partition and return NULL. Return the relation's relcache entry if the
+ * caller needs to process it (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params, ClusterCommand cmd,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot cluster temporary tables of other sessions")));
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index db5da3ce826..a4ad23448f8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index f1156e2fca3..ccf630edbb9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,7 +381,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11893,6 +11894,60 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2;
+ n->indexname = $3;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' qualified_name repack_index_specification
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5;
+ n->indexname = $6;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+
+ | REPACK
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')'
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = NULL;
+ n->indexname = NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17934,6 +17989,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18566,6 +18622,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..ddec4914ea5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index c916b9299a8..8512e099b03 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4913,6 +4913,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..3be57c97b3f 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,8 +31,24 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
@@ -48,4 +64,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..7644267e14f 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,48 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/* Commands of PROGRESS_REPACK */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4610fc61293..648484205cb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3923,6 +3923,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..22559369e2c 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -374,6 +374,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..e69e366dcdc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -28,6 +28,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
+ PROGRESS_COMMAND_REPACK,
} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..e9fd7512710 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,63 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +438,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +581,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..e7513e64fd2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2058,6 +2058,33 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 1 THEN 'REPACK'::text
+ ELSE NULL::text
+ END AS command,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..cfcc3dc9761 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,19 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +172,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +270,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c3f05796a7c..fbfd875b5c2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -422,6 +422,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2515,6 +2516,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
On Fri, Apr 4, 2025, at 1:38 PM, Antonin Houska wrote:
Euler Taveira <euler@eulerto.com> wrote:
+ + <warning> + <para> + The <command>FULL</command> parameter is deprecated in favor of + <xref linkend="sql-repack"/>. + </para> + </warning> +The warnings, notes, and tips are usually placed *after* the description.
You probably mean the subsecions "Notes on Clustering" and "Notes on
Resources". I moved them into the "Notes" section.
No. I said that it should be put after the <para> not before.
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
+ <warning>
+ <para>
+ The <command>FULL</command> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
+ SELECT + S.pid AS pid, + S.datid AS datid, + D.datname AS datname, + S.relid AS relid, + CASE S.param1 WHEN 1 THEN 'REPACK' + END AS command,Do you really need command? IIUC REPACK is the only command that will used by
this view. There is no need to differentiate commands here.REPACK is a regular command, so why shouldn't it have its view? Just like
CLUSTER has one (pg_stat_progress_cluster).
You missed my point. IIRC the command is relevant in the
pg_stat_progress_cluster because there are multiple commands (CLUSTER, VACUUM
FULL). However, in this new view there will be only one command so it is not
necessary to inform it.
+ * + * 'cmd' indicates which commands is being executed. REPACK should be the only + * caller of this function in the future.command.
Not sure I understand this comment.
Singular form. ... which command is ...
--
Euler Taveira
EDB https://www.enterprisedb.com/
Euler Taveira <euler@eulerto.com> wrote:
On Fri, Apr 4, 2025, at 1:38 PM, Antonin Houska wrote:
Euler Taveira <euler@eulerto.com> wrote:
+ + <warning> + <para> + The <command>FULL</command> parameter is deprecated in favor of + <xref linkend="sql-repack"/>. + </para> + </warning> +The warnings, notes, and tips are usually placed *after* the description.
You probably mean the subsecions "Notes on Clustering" and "Notes on
Resources". I moved them into the "Notes" section.No. I said that it should be put after the <para> not before.
@@ -98,6 +98,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re <varlistentry> <term><literal>FULL</literal></term> <listitem> + + <warning> + <para> + The <command>FULL</command> parameter is deprecated in favor of + <xref linkend="sql-repack"/>. + </para> + </warning> + <para> Selects <quote>full</quote> vacuum, which can reclaim more space, but takes much longer and exclusively locks the table.+ SELECT + S.pid AS pid, + S.datid AS datid, + D.datname AS datname, + S.relid AS relid, + CASE S.param1 WHEN 1 THEN 'REPACK' + END AS command,Do you really need command? IIUC REPACK is the only command that will used by
this view. There is no need to differentiate commands here.REPACK is a regular command, so why shouldn't it have its view? Just like
CLUSTER has one (pg_stat_progress_cluster).You missed my point. IIRC the command is relevant in the
pg_stat_progress_cluster because there are multiple commands (CLUSTER, VACUUM
FULL). However, in this new view there will be only one command so it is not
necessary to inform it.+ * + * 'cmd' indicates which commands is being executed. REPACK should be the only + * caller of this function in the future.command.
Not sure I understand this comment.
Singular form. ... which command is ...
This is the next version. It addresses these remaining concerns and also gets
rid of the unnecessary rules in gram.y (which complained about earlier).
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v13-0001-Add-REPACK-command.patchtext/x-diffDownload
From 7ab300c4eabae3ec9adae66234a4f9949d3c7918 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:13 +0200
Subject: [PATCH 1/7] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 220 +++++++++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 82 +---
doc/src/sgml/ref/repack.sgml | 256 +++++++++++++
doc/src/sgml/ref/vacuum.sgml | 9 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 26 ++
src/backend/commands/cluster.c | 468 +++++++++++++++++------
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 53 ++-
src/backend/tcop/utility.c | 9 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 31 +-
src/include/commands/cluster.h | 19 +-
src/include/commands/progress.h | 67 +++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 123 ++++++
src/test/regress/expected/rules.out | 23 ++
src/test/regress/sql/cluster.sql | 59 +++
src/tools/pgindent/typedefs.list | 2 +
25 files changed, 1291 insertions(+), 213 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c421d89edff..9f1432c1ae6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -400,6 +400,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5943,6 +5951,218 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..ee4fd965928 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,18 +42,6 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
-
<para>
When a table is clustered, <productname>PostgreSQL</productname>
remembers which index it was clustered by. The form
@@ -78,6 +66,25 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
database operations (both reads and writes) from operating on the
table until the <command>CLUSTER</command> is finished.
</para>
+
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
+
</refsect1>
<refsect1>
@@ -136,63 +143,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..c74a5023a54
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,256 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+[ <replaceable class="parameter">table_name</replaceable> [ USING INDEX
+<replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is repacked
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is repacked using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+ <para>
+ Repack the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal> (Since index is used here, this is
+ effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..cee1cf3926c 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
@@ -106,6 +107,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
the operation is complete. Usually this should only be used when a
significant amount of space needs to be reclaimed from within the table.
</para>
+
+ <warning>
+ <para>
+ The <option>FULL</option> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..d91e66241fb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..2ff3322580f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1276,6 +1276,32 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ -- param1 is currently unused
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..c6f2a3ace5a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,17 +67,24 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params,
+ ClusterCommand cmd,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -134,71 +141,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_CLUSTER,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -231,7 +178,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +192,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +209,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +232,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +255,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which command is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -323,13 +276,26 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +319,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid,
+ CLUSTER_COMMAND_CLUSTER))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,8 +370,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
@@ -415,21 +386,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- if (OidIsValid(indexOid))
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster temporary tables of other sessions")));
+ else if (cmd == CLUSTER_COMMAND_REPACK)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
else
+ {
+ Assert (cmd == CLUSTER_COMMAND_VACUUM);
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot vacuum temporary tables of other sessions")));
+ }
}
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap,
+ (cmd == CLUSTER_COMMAND_CLUSTER ?
+ "CLUSTER" : (cmd == CLUSTER_COMMAND_REPACK ?
+ "REPACK" : "VACUUM")));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
@@ -469,7 +452,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -626,7 +609,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -642,7 +626,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
(index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
- if (index)
+ if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
@@ -1458,8 +1442,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1493,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1650,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1672,67 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all relations that the current user has the appropriate privileges
+ * for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1740,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1774,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1796,211 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
- ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
- get_rel_name(relid))));
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(WARNING,
+ (errmsg("permission denied to cluster \"%s\", skipping it",
+ get_rel_name(relid))));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(WARNING,
+ (errmsg("permission denied to repack \"%s\", skipping it",
+ get_rel_name(relid))));
+ }
+
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_REPACK,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation if it's a non-partitioned table or a leaf
+ * partition and return NULL. Return the relation's relcache entry if the
+ * caller needs to process it (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params, ClusterCommand cmd,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ {
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot cluster temporary tables of other sessions")));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
+ }
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index db5da3ce826..a4ad23448f8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2263,7 +2263,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3c4268b271a..00813f88b47 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,11 +381,11 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
- opt_collate
+ opt_collate opt_repack_args
%type <range> qualified_name insert_target OptConstrFromTable
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11892,6 +11893,48 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
+ n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
+ n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_repack_args:
+ qualified_name repack_index_specification
+ {
+ $$ = list_make2($1, $2);
+ }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
+
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17933,6 +17976,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18565,6 +18609,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..bf3ba3c2ae7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 97af7c6554f..ddec4914ea5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index c916b9299a8..8512e099b03 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4913,6 +4913,35 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..3be57c97b3f 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,8 +31,24 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
@@ -48,4 +64,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..f92ff524031 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,55 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, these values are also
+ * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
+ * introduce a separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/*
+ * Commands of PROGRESS_REPACK
+ *
+ * Currently we only have one command, so the PROGRESS_REPACK_COMMAND
+ * parameter is not necessary. However it makes cluster.c simpler if we have
+ * the same set of parameters for CLUSTER and REPACK - see the note on REPACK
+ * parameters above.
+ */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4610fc61293..648484205cb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3923,6 +3923,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..22559369e2c 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -374,6 +374,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..e69e366dcdc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -28,6 +28,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
+ PROGRESS_COMMAND_REPACK,
} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..e9fd7512710 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,63 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +438,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +581,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..328235044d9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2062,6 +2062,29 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..cfcc3dc9761 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,19 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +172,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +270,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d16bc208654..bc2176b62ec 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -425,6 +425,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2525,6 +2526,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
v13-0002-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 259862aa3f0f4dd0744937a2208d543b713b64a7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:13 +0200
Subject: [PATCH 2/7] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0d7bddbe4ed..feaa3ac5ad4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.43.5
v13-0003-Move-the-recheck-branch-to-a-separate-function.patchtext/x-diffDownload
From d83e7870d0b33fb0af0df01989d9019919fd3fff Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:13 +0200
Subject: [PATCH 3/7] Move the "recheck" branch to a separate function.
At some point I thought that the relation must be unlocked during the call of
setup_logical_decoding(), to avoid a deadlock. In that case we'd need to
recheck afterwards if the table still meets the requirements of cluster_rel().
Eventually I concluded that the risk of that deadlock is not that high, so the
table stays locked during the call of setup_logical_decoding(). Therefore the
rechecking code is only executed once per table. Anyway, this patch might be
useful in terms of code readability.
---
src/backend/commands/cluster.c | 108 +++++++++++++++++++--------------
1 file changed, 62 insertions(+), 46 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index c6f2a3ace5a..c463cd58672 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -69,6 +69,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
ClusterCommand cmd);
+static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
@@ -317,53 +319,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- {
- /* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid,
- CLUSTER_COMMAND_CLUSTER))
- {
- relation_close(OldHeap, AccessExclusiveLock);
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ params->options))
goto out;
- }
-
- /*
- * Silently skip a temp table for a remote session. Only doing this
- * check in the "recheck" case is appropriate (which currently means
- * somebody is executing a database-wide CLUSTER or on a partitioned
- * table), because there is another check in cluster() which will stop
- * any attempt to cluster remote temp tables by name. There is
- * another check in cluster_rel which is redundant, but we leave it
- * for extra safety.
- */
- if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- if (OidIsValid(indexOid))
- {
- /*
- * Check that the index still exists
- */
- if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- /*
- * Check that the index is still the one with indisclustered set,
- * if needed.
- */
- if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
- !get_index_isclustered(indexOid))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
- }
- }
/*
* We allow VACUUM FULL, but not CLUSTER, on shared catalogs. CLUSTER
@@ -465,6 +423,64 @@ out:
pgstat_progress_end_command();
}
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options)
+{
+ Oid tableOid = RelationGetRelid(OldHeap);
+
+ /* Check that the user still has privileges for the relation */
+ if (!cluster_is_permitted_for_relation(tableOid, userid,
+ CLUSTER_COMMAND_CLUSTER))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Silently skip a temp table for a remote session. Only doing this check
+ * in the "recheck" case is appropriate (which currently means somebody is
+ * executing a database-wide CLUSTER or on a partitioned table), because
+ * there is another check in cluster() which will stop any attempt to
+ * cluster remote temp tables by name. There is another check in
+ * cluster_rel which is redundant, but we leave it for extra safety.
+ */
+ if (RELATION_IS_OTHER_TEMP(OldHeap))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ if (OidIsValid(indexOid))
+ {
+ /*
+ * Check that the index still exists
+ */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Check that the index is still the one with indisclustered set, if
+ * needed.
+ */
+ if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+ !get_index_isclustered(indexOid))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+ }
+
+ return true;
+}
+
/*
* Verify that the specified heap and index are valid to cluster on
*
--
2.43.5
v13-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From 39d899a8ce9a38bca6439ac36158e52bbdc088ab Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:14 +0200
Subject: [PATCH 4/7] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we should not request a
stronger lock without releasing the weaker one first, we acquire the exclusive
lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even to write to it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock that we need to swap the files. (Of course, more
data changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.
The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of a WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.
The WAL records produced by running DML commands on the new relation do not
contain enough information to be processed by the logical decoding system. All
we need from the new relation is the file (relfilenode), while the actual
relation is eventually dropped. Thus there is no point in replaying the DMLs
anywhere.
---
doc/src/sgml/monitoring.sgml | 37 +-
doc/src/sgml/mvcc.sgml | 12 +-
doc/src/sgml/ref/repack.sgml | 134 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 34 +-
src/backend/access/heap/heapam_handler.c | 215 +-
src/backend/access/heap/rewriteheap.c | 6 +-
src/backend/access/transam/xact.c | 11 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 1895 +++++++++++++++--
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 1 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 15 +-
src/backend/replication/logical/decode.c | 83 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 +++
src/backend/storage/ipc/ipci.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/relcache.c | 1 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 9 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 87 +-
src/include/commands/progress.h | 23 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 4 +
39 files changed, 2767 insertions(+), 330 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9f1432c1ae6..8a9d16dcc5b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6051,14 +6051,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6139,6 +6160,14 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK CONCURRENTLY</command> is currently processing the DML
+ commands that other transactions executed during any of the preceding
+ phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 049ee75a4ba..0f5c34af542 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,15 +1833,17 @@ SELECT pg_advisory_lock(q.id) FROM
<title>Caveats</title>
<para>
- Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
- table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
+ Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
+ table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
+ TABLE</command></link> and <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option, are not
MVCC-safe. This means that after the truncation or rewrite commits, the
table will appear empty to concurrent transactions, if they are using a
- snapshot taken before the DDL command committed. This will only be an
+ snapshot taken before the command committed. This will only be an
issue for a transaction that did not access the table in question
- before the DDL command started — any transaction that has done so
+ before the command started — any transaction that has done so
would hold at least an <literal>ACCESS SHARE</literal> table lock,
- which would block the DDL command until that transaction completes.
+ which would block the truncating or rewriting command until that transaction completes.
So these commands will not cause any apparent inconsistency in the
table contents for successive queries on the target table, but they
could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index c74a5023a54..c837e4614f3 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,8 +22,10 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
-[ <replaceable class="parameter">table_name</replaceable> [ USING INDEX
-<replaceable class="parameter">index_name</replaceable> ] ]
+[ <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING
+INDEX <replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -50,7 +52,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -63,7 +66,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -162,6 +166,128 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored separately in a temporary file, so they can eventually be
+ applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <warning>
+ <para>
+ <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
+ option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
+ details.
+ </para>
+ </warning>
+
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed2e3021799..da98aadf39f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2744,7 +2745,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
TM_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TM_FailureData *tmfd, bool changingPart, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -2989,7 +2990,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3079,6 +3081,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3171,7 +3182,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3212,7 +3224,7 @@ TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -4103,7 +4115,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4461,7 +4474,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes,
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -8794,7 +8808,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8805,7 +8820,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data = RelationIsLogicallyLogged(reln) &&
+ wal_logical;
bool init;
int bufflags;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d91e66241fb..9d55004305f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -309,7 +310,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
+ true);
}
@@ -328,7 +330,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
tuple->t_tableOid = slot->tts_tableOid;
result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
@@ -685,13 +687,15 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
double *tups_vacuumed,
double *tups_recently_dead)
{
- RewriteState rwstate;
+ RewriteState rwstate = NULL;
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
@@ -705,6 +709,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -720,9 +726,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values = (Datum *) palloc(natts * sizeof(Datum));
isnull = (bool *) palloc(natts * sizeof(bool));
- /* Initialize the rewrite operation */
- rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ /*
+ * Initialize the rewrite operation.
+ */
+ if (!concurrent)
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin,
+ *xid_cutoff, *multi_cutoff);
/* Set up sorting if wanted */
@@ -737,6 +746,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* Prepare to scan the OldHeap. To ensure we see recently-dead tuples
* that still need to be copied, we scan with SnapshotAny and use
* HeapTupleSatisfiesVacuum for the visibility test.
+ *
+ * In the CONCURRENTLY case, we do regular MVCC visibility tests, using
+ * the snapshot passed by the caller.
*/
if (OldIndex != NULL && !use_sort)
{
@@ -753,7 +765,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex,
+ snapshot ? snapshot :SnapshotAny,
+ NULL, 0, 0);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
@@ -762,7 +776,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
- tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ tableScan = table_beginscan(OldHeap,
+ snapshot ? snapshot :SnapshotAny,
+ 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
@@ -785,6 +801,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -837,70 +854,84 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
buf = hslot->buffer;
- LockBuffer(buf, BUFFER_LOCK_SHARE);
-
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ /*
+ * Regarding CONCURRENTLY, see the comments on MVCC snapshot above.
+ */
+ if (!concurrent)
{
- case HEAPTUPLE_DEAD:
- /* Definitely dead */
- isdead = true;
- break;
- case HEAPTUPLE_RECENTLY_DEAD:
- *tups_recently_dead += 1;
- /* fall through */
- case HEAPTUPLE_LIVE:
- /* Live or recently dead, must copy it */
- isdead = false;
- break;
- case HEAPTUPLE_INSERT_IN_PROGRESS:
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
- /*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
- * catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
- */
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
- elog(WARNING, "concurrent insert in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as live */
- isdead = false;
- break;
- case HEAPTUPLE_DELETE_IN_PROGRESS:
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
+ {
+ case HEAPTUPLE_DEAD:
+ /* Definitely dead */
+ isdead = true;
+ break;
+ case HEAPTUPLE_RECENTLY_DEAD:
+ *tups_recently_dead += 1;
+ /* fall through */
+ case HEAPTUPLE_LIVE:
+ /* Live or recently dead, must copy it */
+ isdead = false;
+ break;
+ case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Similar situation to INSERT_IN_PROGRESS case.
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
+ * catalogs, since we tend to release write lock before commit
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
- elog(WARNING, "concurrent delete in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as recently dead */
- *tups_recently_dead += 1;
- isdead = false;
- break;
- default:
- elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
- isdead = false; /* keep compiler quiet */
- break;
- }
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
+ elog(WARNING, "concurrent insert in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as live */
+ isdead = false;
+ break;
+ case HEAPTUPLE_DELETE_IN_PROGRESS:
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ /*
+ * Similar situation to INSERT_IN_PROGRESS case.
+ */
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
+ elog(WARNING, "concurrent delete in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as recently dead */
+ *tups_recently_dead += 1;
+ isdead = false;
+ break;
+ default:
+ elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+ isdead = false; /* keep compiler quiet */
+ break;
+ }
- if (isdead)
- {
- *tups_vacuumed += 1;
- /* heap rewrite module still needs to see it... */
- if (rewrite_heap_dead_tuple(rwstate, tuple))
+ if (isdead)
{
- /* A previous recently-dead tuple is now known dead */
*tups_vacuumed += 1;
- *tups_recently_dead -= 1;
+ /* heap rewrite module still needs to see it... */
+ if (rewrite_heap_dead_tuple(rwstate, tuple))
+ {
+ /* A previous recently-dead tuple is now known dead */
+ *tups_vacuumed += 1;
+ *tups_recently_dead -= 1;
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
}
- continue;
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we
+ * don't worry whether the source tuple will be deleted / updated
+ * after we release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
*num_tuples += 1;
@@ -919,7 +950,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +965,31 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1033,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -985,7 +1041,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}
/* Write out any remaining tuples, and fsync if needed */
- end_heap_rewrite(rwstate);
+ if (rwstate)
+ end_heap_rewrite(rwstate);
/* Clean up */
pfree(values);
@@ -2376,6 +2433,10 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
* SET WITHOUT OIDS.
*
* So, we must reconstruct the tuple from component Datums.
+ *
+ * If rwstate=NULL, use simple_heap_insert() instead of rewriting - in that
+ * case we still need to deform/form the tuple. TODO Shouldn't we rename the
+ * function, as might not do any rewrite?
*/
static void
reform_and_rewrite_tuple(HeapTuple tuple,
@@ -2398,8 +2459,28 @@ reform_and_rewrite_tuple(HeapTuple tuple,
copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
- /* The heap rewrite module does the rest */
- rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ if (rwstate)
+ /* The heap rewrite module does the rest */
+ rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ else
+ {
+ /*
+ * Insert tuple when processing REPACK CONCURRENTLY.
+ *
+ * rewriteheap.c is not used in the CONCURRENTLY case because it'd be
+ * difficult to do the same in the catch-up phase (as the logical
+ * decoding does not provide us with sufficient visibility
+ * information). Thus we must use heap_insert() both during the
+ * catch-up and here.
+ *
+ * The following is like simple_heap_insert() except that we pass the
+ * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
+ * the relation files, it drops this relation, so no logical
+ * replication subscription should need the data.
+ */
+ heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
+ HEAP_INSERT_NO_LOGICAL, NULL);
+ }
heap_freetuple(copiedTuple);
}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
int options = HEAP_INSERT_SKIP_FSM;
/*
- * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
- * for the TOAST table are not logically decoded. The main heap is
- * WAL-logged as XLOG FPI records, which are not logically decoded.
+ * While rewriting the heap for REPACK, make sure data for the TOAST
+ * table are not logically decoded. The main heap is WAL-logged as
+ * XLOG FPI records, which are not logically decoded.
*/
options |= HEAP_INSERT_NO_LOGICAL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..23f2de587a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
bool parallelChildXact; /* is any parent transaction parallel? */
bool chain; /* start a new block after this one */
bool topXidLogged; /* for a subxact: is top-level XID logged? */
+ bool internal; /* for a subxact: launched internally? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -4723,6 +4724,7 @@ BeginInternalSubTransaction(const char *name)
/* Normal subtransaction start */
PushTransaction();
s = CurrentTransactionState; /* changed by push */
+ s->internal = true;
/*
* Savepoint names, like the TransactionState block itself, live
@@ -5239,7 +5241,13 @@ AbortSubTransaction(void)
LWLockReleaseAll();
pgstat_report_wait_end();
- pgstat_progress_end_command();
+
+ /*
+ * Internal subtransacion might be used by an user command, in which case
+ * the command outlives the subtransaction.
+ */
+ if (!s->internal)
+ pgstat_progress_end_command();
pgaio_error_cleanup();
@@ -5456,6 +5464,7 @@ PushTransaction(void)
s->parallelModeLevel = 0;
s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
s->topXidLogged = false;
+ s->internal = false;
CurrentTransactionState = s;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2ff3322580f..d19770451d0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1263,16 +1263,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1288,16 +1289,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index c463cd58672..592909f453f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -67,15 +77,45 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options);
+ LOCKMODE lmode, int options);
+static void check_repack_concurrently_requirements(Relation rel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ bool concurrent, Oid userid, ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose,
+ bool *pSwapToastByContent,
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -83,7 +123,53 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
static Relation process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
ClusterCommand cmd,
Oid *indexOid_p);
@@ -142,8 +228,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
rel = process_single_relation(stmt->relation, stmt->indexname,
+ AccessExclusiveLock, isTopLevel,
¶ms, CLUSTER_COMMAND_CLUSTER,
&indexOid);
if (rel == NULL)
@@ -194,7 +280,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -211,7 +298,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -231,10 +319,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -258,12 +346,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which command is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -272,8 +366,34 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ check_repack_concurrently_requirements(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -319,7 +439,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, lmode,
params->options))
goto out;
@@ -338,6 +458,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -376,7 +502,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -393,7 +519,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ if (index)
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -406,11 +534,35 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure that our logical decoding
+ * ignores data changes of other tables than the one we are
+ * processing.
+ */
+ if (concurrent)
+ begin_concurrent_repack(OldHeap);
+
+ rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
+ cmd);
+ }
+ PG_FINALLY();
+ {
+ if (concurrent)
+ end_concurrent_repack();
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -429,7 +581,7 @@ out:
*/
static bool
cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options)
+ LOCKMODE lmode, int options)
{
Oid tableOid = RelationGetRelid(OldHeap);
@@ -437,7 +589,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if (!cluster_is_permitted_for_relation(tableOid, userid,
CLUSTER_COMMAND_CLUSTER))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -451,7 +603,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -462,7 +614,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -473,7 +625,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
}
@@ -614,19 +766,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ bool concurrent, Oid userid, ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -634,13 +854,55 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Prepare to capture the concurrent data changes.
+ *
+ * Note that this call waits for all transactions with XID already
+ * assigned to finish. If some of those transactions is waiting for a
+ * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+ * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+ * risk is worth unlocking/locking the table (and its clustering
+ * index) and checking again if its still eligible for REPACK
+ * CONCURRENTLY.
+ */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ PushActiveSnapshot(snapshot);
+ }
if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
@@ -648,7 +910,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -664,30 +925,67 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ /* The historic snapshot won't be needed anymore. */
+ if (snapshot)
+ PopActiveSnapshot();
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ if (concurrent)
+ {
+ /*
+ * Push a snapshot that we will use to find old versions of rows when
+ * processing concurrent UPDATE and DELETE commands. (That snapshot
+ * should also be used by index expressions.)
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Make sure we can find the tuples just inserted when applying DML
+ * commands on top of those.
+ */
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ ctx, swap_toast_by_content,
+ frozenXid, cutoffMulti);
+ PopActiveSnapshot();
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -822,15 +1120,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -848,6 +1150,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
+
pg_rusage_init(&ru0);
/* Store a copy of the namespace name for logging purposes */
@@ -950,8 +1254,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -980,7 +1324,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -989,7 +1335,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1447,14 +1797,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1480,39 +1829,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1833,90 +2190,1315 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
*/
-void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+static void
+begin_concurrent_repack(Relation rel)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ Oid toastrelid;
- /* Parse option list */
- foreach(lc, stmt->params)
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- DefElem *opt = (DefElem *) lfirst(lc);
+ Relation toastrel;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
}
+}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+ /*
+ * Restore normal function of (future) logical decoding for this backend.
+ */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+}
- if (stmt->relation != NULL)
- {
- /* This is the single-relation case. */
- rel = process_single_relation(stmt->relation, stmt->indexname,
- ¶ms, CLUSTER_COMMAND_REPACK,
- &indexOid);
- if (rel == NULL)
- return;
- }
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Check if we can use logical decoding.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
- {
- Oid relid;
- bool rel_is_index;
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
{
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
- /* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
- }
- else
- rtcs = get_tables_to_repack(repack_context);
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ /* clear all timetravel entries */
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /*
+ * If a change was applied now, increment CID for next writes and
+ * update the snapshot so it sees the changes we've applied so far.
+ */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ /*
+ * Like simple_heap_insert(), but make sure that the INSERT is not
+ * logically decoded - see reform_and_rewrite_tuple() for more
+ * information.
+ */
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
+ NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ LockTupleMode lockmode;
+ TM_FailureData tmfd;
+ TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ List *recheck;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ *
+ * Do it like in simple_heap_update(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_update(rel, &tup_target->t_self, tup,
+ GetCurrentCommandId(true),
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ false /* wal_logical */);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ TM_Result res;
+ TM_FailureData tmfd;
+
+ /*
+ * Delete tuple from the new heap.
+ *
+ * Do it like in simple_heap_delete(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
+ InvalidSnapshot, false,
+ &tmfd,
+ false, /* no wait - only we are doing changes */
+ false /* wal_logical */);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ Relation *ind_refs,
+ *ind_refs_p;
+ int nind;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock - no
+ * need to change that.
+ *
+ * We assume that ShareUpdateExclusiveLock on the table prevents anyone
+ * from dropping the existing indexes or adding new ones, so the lists of
+ * old and new indexes should match at the swap time. On the other hand we
+ * do not block ALTER INDEX commands that do not require table lock
+ * (e.g. ALTER INDEX ... SET ...).
+ *
+ * XXX Should we check a the end of our work if another transaction
+ * executed such a command and issue a NOTICE that we might have discarded
+ * its effects? (For example, someone changes storage parameter after we
+ * have created the new index, the new value of that parameter is lost.)
+ * Alternatively, we can lock all the indexes now in a mode that blocks
+ * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep
+ * them locked till the end of the transactions. That might increase the
+ * risk of deadlock during the lock upgrade below, however SELECT / DML
+ * queries should not be involved in such a deadlock.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+ * is one), all its indexes, so that we can swap the files.
+ *
+ * Before that, unlock the index temporarily to avoid deadlock in case
+ * another transaction is trying to lock it while holding the lock on the
+ * table.
+ */
+ if (cl_index)
+ {
+ index_close(cl_index, ShareUpdateExclusiveLock);
+ cl_index = NULL;
+ }
+ /* For the same reason, unlock TOAST relation. */
+ if (OldHeap->rd_rel->reltoastrelid)
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+ /* Finally lock the table */
+ LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+ /*
+ * Lock all indexes now, not only the clustering one: all indexes need to
+ * have their files swapped. While doing that, store their relation
+ * references in an array, to handle predicate locks below.
+ */
+ ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+ nind = 0;
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+ Relation index;
+
+ ind_oid = lfirst_oid(lc);
+ index = index_open(ind_oid, AccessExclusiveLock);
+ /*
+ * TODO 1) Do we need to check if ALTER INDEX was executed since the
+ * new index was created in build_new_indexes()? 2) Specifically for
+ * the clustering index, should check_index_is_clusterable() be called
+ * here? (Not sure about the latter: ShareUpdateExclusiveLock on the
+ * table probably blocks all commands that affect the result of
+ * check_index_is_clusterable().)
+ */
+ *ind_refs_p = index;
+ ind_refs_p++;
+ nind++;
+ }
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+ * lock is needed to swap the files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < nind; i++)
+ {
+ Relation index = ind_refs[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ /*
+ * Even ShareUpdateExclusiveLock should have prevented others from
+ * creating / dropping indexes (even using the CONCURRENTLY option), so we
+ * do not need to check whether the lists match.
+ */
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ lockmode, isTopLevel, ¶ms,
+ CLUSTER_COMMAND_REPACK, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, lockmode);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
- /* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1933,6 +3515,7 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode, bool isTopLevel,
ClusterParams *params, ClusterCommand cmd,
Oid *indexOid_p)
{
@@ -1943,12 +3526,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -2012,7 +3593,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e7854add178..df879c2a18d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 686f1850cab..f4e1ed0ec5f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5989,6 +5989,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a4ad23448f8..f9f8f5ebb58 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1996,7 +1997,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2264,7 +2265,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2310,7 +2311,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 00813f88b47..72f75f25fcc 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11897,27 +11897,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK opt_repack_args
+ REPACK opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
- n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->relation = $3 ? (RangeVar *) linitial($3) : NULL;
+ n->indexname = $3 ? (char *) lsecond($3) : NULL;
n->params = NIL;
+ n->concurrent = $2;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' opt_repack_args
+ | REPACK '(' utility_option_list ')' opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
- n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->relation = $6 ? (RangeVar *) linitial($6) : NULL;
+ n->indexname = $6 ? (char *) lsecond($6) : NULL;
n->params = $3;
+ n->concurrent = $5;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 78f9a0a11c4..bc0e4397f35 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -467,6 +468,88 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it - REPACK CONCURRENTLY is the typical use
+ * case.
+ *
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The problem we solve here is that REPACK CONCURRENTLY generates WAL
+ * when doing changes in the new table. Those changes should not be useful
+ * for any other user (such as logical replication subscription) because
+ * the new table will eventually be dropped (after REPACK CONCURRENTLY has
+ * assigned its file to the "old table").
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * This does happen when 1) raw_heap_insert marks the TOAST
+ * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+ * replays inserts performed by other backends.
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index feaa3ac5ad4..5d552f9ce74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 00c76d05356..f247e4e7c16 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 930321905f1..4155faf6c76 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -353,6 +353,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2905ae86a20..75434e32198 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 8512e099b03..24016228522 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4914,18 +4914,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..be36bb51d0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -322,14 +322,15 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes, bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
@@ -411,6 +412,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..58356392895 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -623,6 +624,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1627,6 +1630,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1639,6 +1646,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1647,6 +1656,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 3be57c97b3f..0a7e72bc74a 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,13 +52,89 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -60,6 +142,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index f92ff524031..4cbf4d16529 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,18 +59,20 @@
/*
* Progress parameters for REPACK.
*
- * Note: Since REPACK shares some code with CLUSTER, these values are also
- * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
- * introduce a separate set of constants.)
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes little
+ * sense to introduce a separate set of constants.)
*/
#define PROGRESS_REPACK_COMMAND 0
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 7
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/*
* Commands of PROGRESS_REPACK
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 648484205cb..d12827f4b5e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3933,6 +3933,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..9bb2f7ae1a8 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 328235044d9..ebaf8fdd268 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1990,17 +1990,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2072,17 +2072,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bc2176b62ec..6bbc8b419f8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -486,6 +486,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1252,6 +1254,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2526,6 +2529,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.43.5
v13-0005-Add-regression-tests.patchtext/x-diffDownload
From 8d09e1f40499c72c0758596904eff2455fc8e6fb Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:14 +0200
Subject: [PATCH 5/7] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 143 ++++++++++++++++++
6 files changed, 270 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 592909f453f..058b750a0ed 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3005,6 +3006,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock");
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..f919087ca5b
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..a17064462ce
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,143 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(i int, j int);
+ CREATE TABLE data_s2(i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+#
+# XXX Not sure this test is useful now - it was designed for the patch that
+# preserves tuple visibility and which therefore modifies
+# TransactionIdIsCurrentTransactionId().
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+#
+# XXX Is this test useful? See above.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.43.5
v13-0006-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From aa65471a23e0f6312232a17b5caabb78da3fc35a Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:14 +0200
Subject: [PATCH 6/7] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 5 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 290 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1674c22cb2..c8119c8fd81 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11234,6 +11234,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index c837e4614f3..7e44fa636ac 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -195,7 +195,10 @@ INDEX <replaceable class="parameter">index_name</replaceable> ]
too many data changes have been done to the table while
<command>REPACK</command> was waiting for the lock: those changes must
be processed just before the files are swapped, while the
- <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9d55004305f..19d2b33f88f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -986,7 +986,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 058b750a0ed..cc765b88d52 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -89,6 +91,15 @@ typedef struct
RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -132,7 +143,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -148,13 +160,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -2351,7 +2365,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -2381,6 +2396,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -2421,7 +2439,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2451,6 +2470,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -2551,10 +2573,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -2736,11 +2770,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -2749,10 +2787,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -2764,7 +2811,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -2772,6 +2819,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -2933,6 +3002,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Relation *ind_refs,
*ind_refs_p;
int nind;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3028,7 +3099,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3124,9 +3196,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock");
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..fb5f1b5e11e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -42,8 +42,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2839,6 +2840,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..b239c58cfc1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -769,6 +769,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 0a7e72bc74a..4914f217267 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -134,7 +136,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index f919087ca5b..02967ed9d48 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index a17064462ce..d0fa38dd8cd 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -130,6 +130,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -141,3 +169,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.43.5
v13-0007-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From 3c0463b3037cc283a445024f45a7aa9103d7746e Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Apr 2025 11:13:14 +0200
Subject: [PATCH 7/7] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
doc/src/sgml/ref/repack.sgml | 7 -
src/backend/access/transam/parallel.c | 8 +
src/backend/access/transam/xact.c | 106 ++++-
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 387 +++++++++++++++++-
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/standby.c | 4 +-
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 4 +
src/include/access/xlog.h | 15 +-
src/include/commands/cluster.h | 5 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 9 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
18 files changed, 540 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 7e44fa636ac..28adb21738a 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -263,13 +263,6 @@ INDEX <replaceable class="parameter">index_name</replaceable> ]
</para>
</listitem>
- <listitem>
- <para>
- The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
- configuration parameter is less than <literal>logical</literal>.
- </para>
- </listitem>
-
<listitem>
<para>
The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f2de587a1..be568f70961 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -126,6 +127,12 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -638,6 +645,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -652,6 +660,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -681,20 +715,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -719,6 +739,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2216,6 +2284,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ec40c0b7c42..b4e07104083 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index cc765b88d52..6e1cdc7bca6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -84,6 +84,14 @@ typedef struct
* The following definitions are used for concurrent processing.
*/
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
/*
* The locators are used to avoid logical decoding of data that we do not need
* for our table.
@@ -135,8 +143,10 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
static LogicalDecodingContext *setup_logical_decoding(Oid relid,
const char *slotname,
TupleDesc tupdesc);
@@ -383,6 +393,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
Relation index;
bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
LOCKMODE lmode;
+ bool entered,
+ success;
/*
* Check that the correct lock is held. The lock mode is
@@ -558,23 +570,31 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
+ entered = false;
+ success = false;
PG_TRY();
{
/*
- * For concurrent processing, make sure that our logical decoding
- * ignores data changes of other tables than the one we are
- * processing.
+ * For concurrent processing, make sure that
+ *
+ * 1) our logical decoding ignores data changes of other tables than
+ * the one we are processing.
+ *
+ * 2) other transactions know that REPACK CONCURRENTLY is in progress
+ * for our table, so they write sufficient information to WAL even if
+ * wal_level is < LOGICAL.
*/
if (concurrent)
- begin_concurrent_repack(OldHeap);
+ begin_concurrent_repack(OldHeap, &index, &entered);
rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
cmd);
+ success = true;
}
PG_FINALLY();
{
- if (concurrent)
- end_concurrent_repack();
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
}
PG_END_TRY();
@@ -2207,6 +2227,49 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
#define REPL_PLUGIN_NAME "pgoutput_repack"
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRelsHash hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+/* Hashtable of RepackedRel elements. */
+static HTAB *repackedRelsHash = NULL;;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Multiply by two because TOAST relations
+ * also need to be added to the hashtable.
+ */
+#define MAX_REPACKED_RELS (max_replication_slots * 2)
+
+Size
+RepackShmemSize(void)
+{
+ return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+ repackedRelsHash = ShmemInitHash("Repacked Relations Hash",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS |
+ HASH_FIXED_SIZE);
+}
+
/*
* Call this function before REPACK CONCURRENTLY starts to setup logical
* decoding. It makes sure that other users of the table put enough
@@ -2221,11 +2284,150 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
*
* Note that TOAST table needs no attention here as it's not scanned using
* historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into repackedRelsHash or not.
*/
static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
{
- Oid toastrelid;
+ Oid relid,
+ toastrelid;
+ Relation index = NULL;
+ Oid indexid = InvalidOid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+ index = index_p ? *index_p : NULL;
+
+ /*
+ * Make sure that we do not leave an entry in repackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+
+ /*
+ * If we could enter the main fork the TOAST should succeed too.
+ * Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry as soon
+ * as they open our table.
+ */
+ CacheInvalidateRelcacheImmediate(relid);
+
+ /*
+ * Also make sure that the existing users of the table update their
+ * relcache entry as soon as they try to run DML commands on it.
+ *
+ * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+ * has a lower lock, we assume it'll accept our invalidation message when
+ * it changes the lock mode.
+ *
+ * Before upgrading the lock on the relation, close the index temporarily
+ * to avoid a deadlock if another backend running DML already has its lock
+ * (ShareLock) on the table and waits for the lock on the index.
+ */
+ if (index)
+ {
+ indexid = RelationGetRelid(index);
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+ LockRelationOid(relid, ShareLock);
+ UnlockRelationOid(relid, ShareLock);
+ if (OidIsValid(indexid))
+ {
+ /*
+ * Re-open the index and check that it hasn't changed while unlocked.
+ */
+ check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock);
+
+ /*
+ * Return the new relcache entry to the caller. (It's been locked by
+ * the call above.)
+ */
+ index = index_open(indexid, NoLock);
+ *index_p = index;
+ }
/* Avoid logical decoding of other relations by this backend. */
repacked_rel_locator = rel->rd_locator;
@@ -2243,15 +2445,176 @@ begin_concurrent_repack(Relation rel)
/*
* Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
*/
static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
{
+ RepackedRel key;
+ RepackedRel *entry = NULL;
+ RepackedRel *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+
+ entry = hash_search(repackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(repackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make others refresh their information whether they should still
+ * treat the table as catalog from the perspective of writing WAL.
+ *
+ * XXX Unlike entering the entry into the hashtable, we do not bother
+ * with locking and unlocking the table here:
+ *
+ * 1) On normal completion (and sometimes even on ERROR), the caller
+ * is already holding AccessExclusiveLock on the table, so there
+ * should be no relcache reference unaware of this change.
+ *
+ * 2) In the other cases, the worst scenario is that the other
+ * backends will write unnecessary information to WAL until they close
+ * the relation.
+ *
+ * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+ * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+ * that probably should not try to acquire any lock.)
+ */
+ CacheInvalidateRelcacheImmediate(repacked_rel);
+
+ /*
+ * By clearing repacked_rel we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ repacked_rel_toast = InvalidOid;
+ }
+
/*
* Restore normal function of (future) logical decoding for this backend.
*/
repacked_rel_locator.relNumber = InvalidOid;
repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ }
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ /* For particular relation we need to search in the hashtable. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just check the number of entries.
+ */
+ if (!OidIsValid(relid))
+ {
+ long n = hash_get_num_entries(repackedRelsHash);
+
+ LWLockRelease(RepackedRelsLock);
+ return n > 0;
+ }
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
}
/*
@@ -2281,7 +2644,7 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
+ * table we are processing is present in repackedRelsHash and therefore,
* regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a8d2e024d34..4909432d585 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f247e4e7c16..7b27068c338 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -153,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
size = add_size(size, MemoryContextReportingShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -347,6 +348,7 @@ CreateOrAttachShmemStructs(void)
InjectionPointShmemInit();
AioShmemInit();
MemoryContextReportingShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 7fa8d9247e0..ab30d448d42 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1325,13 +1325,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 4eb67720737..14eda1c24ee 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1633,6 +1633,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = relid;
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 75434e32198..058c95bc847 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1279,6 +1279,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4914f217267..9d5a30d0689 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -150,5 +150,10 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..cc84592eb1f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -708,12 +711,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6bbc8b419f8..f885eb6d28b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2529,6 +2529,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
Matheus Alcantara <matheusssilv97@gmail.com> wrote:
Hi,
On Tue, Apr 1, 2025 at 10:31 AM Antonin Houska <ah@cybertec.at> wrote:
One more version, hopefully to make cfbot happy (I missed the bug because I
did not set the RELCACHE_FORCE_RELEASE macro in my environment.)Thanks for the new version! I'm starting to study this patch series and
I just want to share some points about the documentation on v12-0004:
Please check the next version [1]/messages/by-id/97795.1744363522@localhost. Thanks for your input.
[1]: /messages/by-id/97795.1744363522@localhost
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
On Fri, Apr 11, 2025 at 5:28 PM Antonin Houska <ah@cybertec.at> wrote:
Please check the next version [1]. Thanks for your input.
Hi, I’ve briefly experimented with v13-0001.
EXPLAIN tab complete:
explain (verbose O
OFF ON
since we already touched the tab-complete for repack.
We can do it similarly.
you may see src/bin/psql/tab-complete.in.c line 4288.
------------------------------
Currently REPACK Synopsis section
looks like the screenshot attached.
make it one line
REPACK [ ( option [, ...] ) ] [ table_name [ USING INDEX index_name ] ]
would look intuitive, IMHO.
------------------------------
+repack_index_specification:
+ USING INDEX name { $$ = $3; }
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
in gram.y line 4685, we have
ExistingIndex: USING INDEX name { $$ = $3; }
;
so here, we can change it to
repack_index_specification:
ExistingIndex
| /*EMPTY*/ { $$ = NULL; }
-------------------------------------------------
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+
+ /* Only interested in relations. */
+ if (get_rel_relkind(relid) != RELKIND_RELATION)
+ continue;
The doc said (Without a table_name, REPACK processes every table and
materialized view...)
but seems plain ``REPACK(verbose) ; ``
will not process materialized view?
hi.
some more minor comments about v13-0001.
GetCommandLogLevel also needs to specify LogStmtLevel for T_RepackStmt?
/*
* (CLUSTER might change the order of
* rows on disk, which could affect the ordering of pg_dump
* output, but that's not semantically significant.)
*/
do we need adjust this comment in ClassifyUtilityCommandAsReadOnly
for the REPACK statement?
<para>
<productname>PostgreSQL</productname> has the ability to report the
progress of
certain commands during command execution. Currently, the only commands
which support progress reporting are <command>ANALYZE</command>,
<command>CLUSTER</command>,
<command>CREATE INDEX</command>, <command>VACUUM</command>,
<command>COPY</command>,
and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
command that <xref linkend="app-pgbasebackup"/> issues to take
a base backup).
This may be expanded in the future.
</para>
also need to mention <command>REPACK</command>?
"The CLUSTER command is deprecated",
then do we need to say something
in doc/src/sgml/ref/clusterdb.sgml?
jian he <jian.universality@gmail.com> wrote:
On Fri, Apr 11, 2025 at 5:28 PM Antonin Houska <ah@cybertec.at> wrote:
Please check the next version [1]. Thanks for your input.
Hi, I’ve briefly experimented with v13-0001.
Thanks! v14 addresses your comments.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v14-0001-Add-REPACK-command.patchtext/x-diffDownload
From c2db801134ebcdca3408d324fa533f24e386c8ae Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 1/7] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 223 ++++++++++-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 82 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 9 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 26 ++
src/backend/commands/cluster.c | 469 +++++++++++++++++------
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 53 ++-
src/backend/tcop/utility.c | 13 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 33 +-
src/include/commands/cluster.h | 19 +-
src/include/commands/progress.h | 67 +++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 123 ++++++
src/test/regress/expected/rules.out | 23 ++
src/test/regress/sql/cluster.sql | 59 +++
src/tools/pgindent/typedefs.list | 2 +
25 files changed, 1298 insertions(+), 214 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..da883bb22f1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -405,6 +405,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5493,7 +5501,8 @@ FROM pg_stat_get_backend_idset() AS backendid;
certain commands during command execution. Currently, the only commands
which support progress reporting are <command>ANALYZE</command>,
<command>CLUSTER</command>,
- <command>CREATE INDEX</command>, <command>VACUUM</command>,
+ <command>CREATE INDEX</command>, <command>REPACK</command>,
+ <command>VACUUM</command>,
<command>COPY</command>,
and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
command that <xref linkend="app-pgbasebackup"/> issues to take
@@ -5952,6 +5961,218 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..ee4fd965928 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,18 +42,6 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
-
<para>
When a table is clustered, <productname>PostgreSQL</productname>
remembers which index it was clustered by. The form
@@ -78,6 +66,25 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
database operations (both reads and writes) from operating on the
table until the <command>CLUSTER</command> is finished.
</para>
+
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
+
</refsect1>
<refsect1>
@@ -136,63 +143,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..a612c72d971
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is repacked
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is repacked using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+ <para>
+ Repack the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal> (Since index is used here, this is
+ effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..cee1cf3926c 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
@@ -106,6 +107,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
the operation is complete. Usually this should only be used when a
significant amount of space needs to be reclaimed from within the table.
</para>
+
+ <warning>
+ <para>
+ The <option>FULL</option> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..0b03070d394 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..466cf0fdef6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 08f780a2e63..7380b6e3d7b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1271,6 +1271,32 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ -- param1 is currently unused
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..5fe8e6cb354 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,17 +67,24 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params,
+ ClusterCommand cmd,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -134,71 +141,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_CLUSTER,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -231,7 +178,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +192,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +209,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +232,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +255,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which command is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -323,13 +276,26 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +319,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid,
+ CLUSTER_COMMAND_CLUSTER))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,8 +370,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
@@ -415,21 +386,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- if (OidIsValid(indexOid))
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster temporary tables of other sessions")));
+ else if (cmd == CLUSTER_COMMAND_REPACK)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
else
+ {
+ Assert (cmd == CLUSTER_COMMAND_VACUUM);
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot vacuum temporary tables of other sessions")));
+ }
}
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap,
+ (cmd == CLUSTER_COMMAND_CLUSTER ?
+ "CLUSTER" : (cmd == CLUSTER_COMMAND_REPACK ?
+ "REPACK" : "VACUUM")));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
@@ -469,7 +452,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -626,7 +609,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -642,7 +626,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
(index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
- if (index)
+ if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
@@ -1458,8 +1442,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1493,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1650,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1672,68 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all relations that the current user has the appropriate privileges
+ * for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+ char relkind = get_rel_relkind(relid);
+
+ /* Only interested in relations. */
+ if (relkind != RELKIND_RELATION && relkind != RELKIND_MATVIEW)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,211 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
- ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
- get_rel_name(relid))));
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(WARNING,
+ (errmsg("permission denied to cluster \"%s\", skipping it",
+ get_rel_name(relid))));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(WARNING,
+ (errmsg("permission denied to repack \"%s\", skipping it",
+ get_rel_name(relid))));
+ }
+
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_REPACK,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation if it's a non-partitioned table or a leaf
+ * partition and return NULL. Return the relation's relcache entry if the
+ * caller needs to process it (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params, ClusterCommand cmd,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ {
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot cluster temporary tables of other sessions")));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
+ }
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 33a33bf6b1c..0bdfcd90878 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2265,7 +2265,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 0b5652071d1..5c41f866cd9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -298,7 +298,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -381,11 +381,11 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
- opt_collate
+ opt_collate opt_repack_args
%type <range> qualified_name insert_target OptConstrFromTable
@@ -764,7 +764,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1100,6 +1100,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11892,6 +11893,48 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
+ n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
+ n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_repack_args:
+ qualified_name repack_index_specification
+ {
+ $$ = list_make2($1, $2);
+ }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
+
+repack_index_specification:
+ ExistingIndex
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17940,6 +17983,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18572,6 +18616,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..6acdff4606f 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
@@ -3510,6 +3519,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_RepackStmt:
+ lev = LOGSTMT_DDL;
+ break;
+
case T_VacuumStmt:
lev = LOGSTMT_ALL;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1c12ddbae49..b2ad8ba45cd 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index ec65ab79fec..03c7be47855 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1223,7 +1223,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4919,6 +4919,37 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ else if (TailMatches("VERBOSE"))
+ COMPLETE_WITH("ON", "OFF");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..3be57c97b3f 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,8 +31,24 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
@@ -48,4 +64,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..f92ff524031 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,55 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, these values are also
+ * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
+ * introduce a separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/*
+ * Commands of PROGRESS_REPACK
+ *
+ * Currently we only have one command, so the PROGRESS_REPACK_COMMAND
+ * parameter is not necessary. However it makes cluster.c simpler if we have
+ * the same set of parameters for CLUSTER and REPACK - see the note on REPACK
+ * parameters above.
+ */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dd00ab420b8..ecc31a107cd 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3928,6 +3928,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..22559369e2c 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -374,6 +374,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..e69e366dcdc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -28,6 +28,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
+ PROGRESS_COMMAND_REPACK,
} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..e9fd7512710 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,63 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +438,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +581,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..328235044d9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2062,6 +2062,29 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..cfcc3dc9761 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,19 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +172,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +270,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8346cda633..f6c77dc9c69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -426,6 +426,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2523,6 +2524,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.43.5
v14-0002-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From 795bc620a0cd2c909fa5a2f678f2f998e3b15418 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 2/7] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0d7bddbe4ed..feaa3ac5ad4 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.43.5
v14-0003-Move-the-recheck-branch-to-a-separate-function.patchtext/x-diffDownload
From 587b0029a51053808dcdc4d72f70e15b31fc15c9 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 3/7] Move the "recheck" branch to a separate function.
At some point I thought that the relation must be unlocked during the call of
setup_logical_decoding(), to avoid a deadlock. In that case we'd need to
recheck afterwards if the table still meets the requirements of cluster_rel().
Eventually I concluded that the risk of that deadlock is not that high, so the
table stays locked during the call of setup_logical_decoding(). Therefore the
rechecking code is only executed once per table. Anyway, this patch might be
useful in terms of code readability.
---
src/backend/commands/cluster.c | 108 +++++++++++++++++++--------------
1 file changed, 62 insertions(+), 46 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 5fe8e6cb354..64e7291a6a1 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -69,6 +69,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
ClusterCommand cmd);
+static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
@@ -317,53 +319,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- {
- /* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid,
- CLUSTER_COMMAND_CLUSTER))
- {
- relation_close(OldHeap, AccessExclusiveLock);
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ params->options))
goto out;
- }
-
- /*
- * Silently skip a temp table for a remote session. Only doing this
- * check in the "recheck" case is appropriate (which currently means
- * somebody is executing a database-wide CLUSTER or on a partitioned
- * table), because there is another check in cluster() which will stop
- * any attempt to cluster remote temp tables by name. There is
- * another check in cluster_rel which is redundant, but we leave it
- * for extra safety.
- */
- if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- if (OidIsValid(indexOid))
- {
- /*
- * Check that the index still exists
- */
- if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- /*
- * Check that the index is still the one with indisclustered set,
- * if needed.
- */
- if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
- !get_index_isclustered(indexOid))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
- }
- }
/*
* We allow VACUUM FULL, but not CLUSTER, on shared catalogs. CLUSTER
@@ -465,6 +423,64 @@ out:
pgstat_progress_end_command();
}
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options)
+{
+ Oid tableOid = RelationGetRelid(OldHeap);
+
+ /* Check that the user still has privileges for the relation */
+ if (!cluster_is_permitted_for_relation(tableOid, userid,
+ CLUSTER_COMMAND_CLUSTER))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Silently skip a temp table for a remote session. Only doing this check
+ * in the "recheck" case is appropriate (which currently means somebody is
+ * executing a database-wide CLUSTER or on a partitioned table), because
+ * there is another check in cluster() which will stop any attempt to
+ * cluster remote temp tables by name. There is another check in
+ * cluster_rel which is redundant, but we leave it for extra safety.
+ */
+ if (RELATION_IS_OTHER_TEMP(OldHeap))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ if (OidIsValid(indexOid))
+ {
+ /*
+ * Check that the index still exists
+ */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Check that the index is still the one with indisclustered set, if
+ * needed.
+ */
+ if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+ !get_index_isclustered(indexOid))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+ }
+
+ return true;
+}
+
/*
* Verify that the specified heap and index are valid to cluster on
*
--
2.43.5
v14-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From 4687f9a8221d7ec99a5ec5c4849ab8416c0a9426 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 4/7] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we should not request a
stronger lock without releasing the weaker one first, we acquire the exclusive
lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even to write to it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock that we need to swap the files. (Of course, more
data changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.
The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of a WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.
The WAL records produced by running DML commands on the new relation do not
contain enough information to be processed by the logical decoding system. All
we need from the new relation is the file (relfilenode), while the actual
relation is eventually dropped. Thus there is no point in replaying the DMLs
anywhere.
---
doc/src/sgml/monitoring.sgml | 37 +-
doc/src/sgml/mvcc.sgml | 12 +-
doc/src/sgml/ref/repack.sgml | 129 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 34 +-
src/backend/access/heap/heapam_handler.c | 215 +-
src/backend/access/heap/rewriteheap.c | 6 +-
src/backend/access/transam/xact.c | 11 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 1895 +++++++++++++++--
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 1 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 15 +-
src/backend/replication/logical/decode.c | 83 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 +++
src/backend/storage/ipc/ipci.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/relcache.c | 1 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 9 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 87 +-
src/include/commands/progress.h | 23 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 4 +
39 files changed, 2764 insertions(+), 328 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index da883bb22f1..cae24f15624 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6061,14 +6061,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6149,6 +6170,14 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK CONCURRENTLY</command> is currently processing the DML
+ commands that other transactions executed during any of the preceding
+ phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 049ee75a4ba..0f5c34af542 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,15 +1833,17 @@ SELECT pg_advisory_lock(q.id) FROM
<title>Caveats</title>
<para>
- Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
- table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
+ Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
+ table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
+ TABLE</command></link> and <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option, are not
MVCC-safe. This means that after the truncation or rewrite commits, the
table will appear empty to concurrent transactions, if they are using a
- snapshot taken before the DDL command committed. This will only be an
+ snapshot taken before the command committed. This will only be an
issue for a transaction that did not access the table in question
- before the DDL command started — any transaction that has done so
+ before the command started — any transaction that has done so
would hold at least an <literal>ACCESS SHARE</literal> table lock,
- which would block the DDL command until that transaction completes.
+ which would block the truncating or rewriting command until that transaction completes.
So these commands will not cause any apparent inconsistency in the
table contents for successive queries on the target table, but they
could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index a612c72d971..9c089a6b3d7 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,128 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored separately in a temporary file, so they can eventually be
+ applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <warning>
+ <para>
+ <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
+ option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
+ details.
+ </para>
+ </warning>
+
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..4fdb3e880e4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2769,7 +2770,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
TM_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TM_FailureData *tmfd, bool changingPart, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -3016,7 +3017,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3106,6 +3108,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3198,7 +3209,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3239,7 +3251,7 @@ TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -4132,7 +4144,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4490,7 +4503,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes,
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -8831,7 +8845,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8842,7 +8857,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data = RelationIsLogicallyLogged(reln) &&
+ wal_logical;
bool init;
int bufflags;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0b03070d394..c829c06f769 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -309,7 +310,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
+ true);
}
@@ -328,7 +330,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
tuple->t_tableOid = slot->tts_tableOid;
result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
@@ -685,13 +687,15 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
double *tups_vacuumed,
double *tups_recently_dead)
{
- RewriteState rwstate;
+ RewriteState rwstate = NULL;
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
@@ -705,6 +709,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -720,9 +726,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values = (Datum *) palloc(natts * sizeof(Datum));
isnull = (bool *) palloc(natts * sizeof(bool));
- /* Initialize the rewrite operation */
- rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ /*
+ * Initialize the rewrite operation.
+ */
+ if (!concurrent)
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin,
+ *xid_cutoff, *multi_cutoff);
/* Set up sorting if wanted */
@@ -737,6 +746,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* Prepare to scan the OldHeap. To ensure we see recently-dead tuples
* that still need to be copied, we scan with SnapshotAny and use
* HeapTupleSatisfiesVacuum for the visibility test.
+ *
+ * In the CONCURRENTLY case, we do regular MVCC visibility tests, using
+ * the snapshot passed by the caller.
*/
if (OldIndex != NULL && !use_sort)
{
@@ -753,7 +765,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex,
+ snapshot ? snapshot :SnapshotAny,
+ NULL, 0, 0);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
@@ -762,7 +776,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
- tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ tableScan = table_beginscan(OldHeap,
+ snapshot ? snapshot :SnapshotAny,
+ 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
@@ -785,6 +801,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -837,70 +854,84 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
buf = hslot->buffer;
- LockBuffer(buf, BUFFER_LOCK_SHARE);
-
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ /*
+ * Regarding CONCURRENTLY, see the comments on MVCC snapshot above.
+ */
+ if (!concurrent)
{
- case HEAPTUPLE_DEAD:
- /* Definitely dead */
- isdead = true;
- break;
- case HEAPTUPLE_RECENTLY_DEAD:
- *tups_recently_dead += 1;
- /* fall through */
- case HEAPTUPLE_LIVE:
- /* Live or recently dead, must copy it */
- isdead = false;
- break;
- case HEAPTUPLE_INSERT_IN_PROGRESS:
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
- /*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
- * catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
- */
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
- elog(WARNING, "concurrent insert in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as live */
- isdead = false;
- break;
- case HEAPTUPLE_DELETE_IN_PROGRESS:
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
+ {
+ case HEAPTUPLE_DEAD:
+ /* Definitely dead */
+ isdead = true;
+ break;
+ case HEAPTUPLE_RECENTLY_DEAD:
+ *tups_recently_dead += 1;
+ /* fall through */
+ case HEAPTUPLE_LIVE:
+ /* Live or recently dead, must copy it */
+ isdead = false;
+ break;
+ case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Similar situation to INSERT_IN_PROGRESS case.
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
+ * catalogs, since we tend to release write lock before commit
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
- elog(WARNING, "concurrent delete in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as recently dead */
- *tups_recently_dead += 1;
- isdead = false;
- break;
- default:
- elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
- isdead = false; /* keep compiler quiet */
- break;
- }
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
+ elog(WARNING, "concurrent insert in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as live */
+ isdead = false;
+ break;
+ case HEAPTUPLE_DELETE_IN_PROGRESS:
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ /*
+ * Similar situation to INSERT_IN_PROGRESS case.
+ */
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
+ elog(WARNING, "concurrent delete in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as recently dead */
+ *tups_recently_dead += 1;
+ isdead = false;
+ break;
+ default:
+ elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+ isdead = false; /* keep compiler quiet */
+ break;
+ }
- if (isdead)
- {
- *tups_vacuumed += 1;
- /* heap rewrite module still needs to see it... */
- if (rewrite_heap_dead_tuple(rwstate, tuple))
+ if (isdead)
{
- /* A previous recently-dead tuple is now known dead */
*tups_vacuumed += 1;
- *tups_recently_dead -= 1;
+ /* heap rewrite module still needs to see it... */
+ if (rewrite_heap_dead_tuple(rwstate, tuple))
+ {
+ /* A previous recently-dead tuple is now known dead */
+ *tups_vacuumed += 1;
+ *tups_recently_dead -= 1;
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
}
- continue;
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we
+ * don't worry whether the source tuple will be deleted / updated
+ * after we release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
*num_tuples += 1;
@@ -919,7 +950,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +965,31 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1033,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -985,7 +1041,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}
/* Write out any remaining tuples, and fsync if needed */
- end_heap_rewrite(rwstate);
+ if (rwstate)
+ end_heap_rewrite(rwstate);
/* Clean up */
pfree(values);
@@ -2376,6 +2433,10 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
* SET WITHOUT OIDS.
*
* So, we must reconstruct the tuple from component Datums.
+ *
+ * If rwstate=NULL, use simple_heap_insert() instead of rewriting - in that
+ * case we still need to deform/form the tuple. TODO Shouldn't we rename the
+ * function, as might not do any rewrite?
*/
static void
reform_and_rewrite_tuple(HeapTuple tuple,
@@ -2398,8 +2459,28 @@ reform_and_rewrite_tuple(HeapTuple tuple,
copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
- /* The heap rewrite module does the rest */
- rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ if (rwstate)
+ /* The heap rewrite module does the rest */
+ rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ else
+ {
+ /*
+ * Insert tuple when processing REPACK CONCURRENTLY.
+ *
+ * rewriteheap.c is not used in the CONCURRENTLY case because it'd be
+ * difficult to do the same in the catch-up phase (as the logical
+ * decoding does not provide us with sufficient visibility
+ * information). Thus we must use heap_insert() both during the
+ * catch-up and here.
+ *
+ * The following is like simple_heap_insert() except that we pass the
+ * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
+ * the relation files, it drops this relation, so no logical
+ * replication subscription should need the data.
+ */
+ heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
+ HEAP_INSERT_NO_LOGICAL, NULL);
+ }
heap_freetuple(copiedTuple);
}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
int options = HEAP_INSERT_SKIP_FSM;
/*
- * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
- * for the TOAST table are not logically decoded. The main heap is
- * WAL-logged as XLOG FPI records, which are not logically decoded.
+ * While rewriting the heap for REPACK, make sure data for the TOAST
+ * table are not logically decoded. The main heap is WAL-logged as
+ * XLOG FPI records, which are not logically decoded.
*/
options |= HEAP_INSERT_NO_LOGICAL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..23f2de587a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
bool parallelChildXact; /* is any parent transaction parallel? */
bool chain; /* start a new block after this one */
bool topXidLogged; /* for a subxact: is top-level XID logged? */
+ bool internal; /* for a subxact: launched internally? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -4723,6 +4724,7 @@ BeginInternalSubTransaction(const char *name)
/* Normal subtransaction start */
PushTransaction();
s = CurrentTransactionState; /* changed by push */
+ s->internal = true;
/*
* Savepoint names, like the TransactionState block itself, live
@@ -5239,7 +5241,13 @@ AbortSubTransaction(void)
LWLockReleaseAll();
pgstat_report_wait_end();
- pgstat_progress_end_command();
+
+ /*
+ * Internal subtransacion might be used by an user command, in which case
+ * the command outlives the subtransaction.
+ */
+ if (!s->internal)
+ pgstat_progress_end_command();
pgaio_error_cleanup();
@@ -5456,6 +5464,7 @@ PushTransaction(void)
s->parallelModeLevel = 0;
s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
s->topXidLogged = false;
+ s->internal = false;
CurrentTransactionState = s;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 466cf0fdef6..c70521d1d54 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7380b6e3d7b..a9d3a4b5787 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1258,16 +1258,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1283,16 +1284,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 64e7291a6a1..432fc510ee6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -67,15 +77,45 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options);
+ LOCKMODE lmode, int options);
+static void check_repack_concurrently_requirements(Relation rel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ bool concurrent, Oid userid, ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose,
+ bool *pSwapToastByContent,
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -83,7 +123,53 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
static Relation process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
ClusterCommand cmd,
Oid *indexOid_p);
@@ -142,8 +228,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
rel = process_single_relation(stmt->relation, stmt->indexname,
+ AccessExclusiveLock, isTopLevel,
¶ms, CLUSTER_COMMAND_CLUSTER,
&indexOid);
if (rel == NULL)
@@ -194,7 +280,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -211,7 +298,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -231,10 +319,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -258,12 +346,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which command is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -272,8 +366,34 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ check_repack_concurrently_requirements(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -319,7 +439,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, lmode,
params->options))
goto out;
@@ -338,6 +458,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -376,7 +502,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -393,7 +519,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ if (index)
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -406,11 +534,35 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure that our logical decoding
+ * ignores data changes of other tables than the one we are
+ * processing.
+ */
+ if (concurrent)
+ begin_concurrent_repack(OldHeap);
+
+ rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
+ cmd);
+ }
+ PG_FINALLY();
+ {
+ if (concurrent)
+ end_concurrent_repack();
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -429,7 +581,7 @@ out:
*/
static bool
cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options)
+ LOCKMODE lmode, int options)
{
Oid tableOid = RelationGetRelid(OldHeap);
@@ -437,7 +589,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if (!cluster_is_permitted_for_relation(tableOid, userid,
CLUSTER_COMMAND_CLUSTER))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -451,7 +603,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -462,7 +614,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -473,7 +625,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
}
@@ -614,19 +766,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ bool concurrent, Oid userid, ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -634,13 +854,55 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Prepare to capture the concurrent data changes.
+ *
+ * Note that this call waits for all transactions with XID already
+ * assigned to finish. If some of those transactions is waiting for a
+ * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+ * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+ * risk is worth unlocking/locking the table (and its clustering
+ * index) and checking again if its still eligible for REPACK
+ * CONCURRENTLY.
+ */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ PushActiveSnapshot(snapshot);
+ }
if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
@@ -648,7 +910,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -664,30 +925,67 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ /* The historic snapshot won't be needed anymore. */
+ if (snapshot)
+ PopActiveSnapshot();
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ if (concurrent)
+ {
+ /*
+ * Push a snapshot that we will use to find old versions of rows when
+ * processing concurrent UPDATE and DELETE commands. (That snapshot
+ * should also be used by index expressions.)
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Make sure we can find the tuples just inserted when applying DML
+ * commands on top of those.
+ */
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ ctx, swap_toast_by_content,
+ frozenXid, cutoffMulti);
+ PopActiveSnapshot();
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -822,15 +1120,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -848,6 +1150,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
+
pg_rusage_init(&ru0);
/* Store a copy of the namespace name for logging purposes */
@@ -950,8 +1254,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -980,7 +1324,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -989,7 +1335,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1447,14 +1797,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1480,39 +1829,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1834,90 +2191,1315 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
*/
-void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+static void
+begin_concurrent_repack(Relation rel)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ Oid toastrelid;
- /* Parse option list */
- foreach(lc, stmt->params)
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- DefElem *opt = (DefElem *) lfirst(lc);
+ Relation toastrel;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
}
+}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+ /*
+ * Restore normal function of (future) logical decoding for this backend.
+ */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+}
- if (stmt->relation != NULL)
- {
- /* This is the single-relation case. */
- rel = process_single_relation(stmt->relation, stmt->indexname,
- ¶ms, CLUSTER_COMMAND_REPACK,
- &indexOid);
- if (rel == NULL)
- return;
- }
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Check if we can use logical decoding.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
- {
- Oid relid;
- bool rel_is_index;
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
{
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
- /* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
- }
- else
- rtcs = get_tables_to_repack(repack_context);
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ /* clear all timetravel entries */
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /*
+ * If a change was applied now, increment CID for next writes and
+ * update the snapshot so it sees the changes we've applied so far.
+ */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ /*
+ * Like simple_heap_insert(), but make sure that the INSERT is not
+ * logically decoded - see reform_and_rewrite_tuple() for more
+ * information.
+ */
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
+ NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ LockTupleMode lockmode;
+ TM_FailureData tmfd;
+ TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ List *recheck;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ *
+ * Do it like in simple_heap_update(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_update(rel, &tup_target->t_self, tup,
+ GetCurrentCommandId(true),
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ false /* wal_logical */);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ TM_Result res;
+ TM_FailureData tmfd;
+
+ /*
+ * Delete tuple from the new heap.
+ *
+ * Do it like in simple_heap_delete(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
+ InvalidSnapshot, false,
+ &tmfd,
+ false, /* no wait - only we are doing changes */
+ false /* wal_logical */);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ Relation *ind_refs,
+ *ind_refs_p;
+ int nind;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock - no
+ * need to change that.
+ *
+ * We assume that ShareUpdateExclusiveLock on the table prevents anyone
+ * from dropping the existing indexes or adding new ones, so the lists of
+ * old and new indexes should match at the swap time. On the other hand we
+ * do not block ALTER INDEX commands that do not require table lock
+ * (e.g. ALTER INDEX ... SET ...).
+ *
+ * XXX Should we check a the end of our work if another transaction
+ * executed such a command and issue a NOTICE that we might have discarded
+ * its effects? (For example, someone changes storage parameter after we
+ * have created the new index, the new value of that parameter is lost.)
+ * Alternatively, we can lock all the indexes now in a mode that blocks
+ * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep
+ * them locked till the end of the transactions. That might increase the
+ * risk of deadlock during the lock upgrade below, however SELECT / DML
+ * queries should not be involved in such a deadlock.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+ * is one), all its indexes, so that we can swap the files.
+ *
+ * Before that, unlock the index temporarily to avoid deadlock in case
+ * another transaction is trying to lock it while holding the lock on the
+ * table.
+ */
+ if (cl_index)
+ {
+ index_close(cl_index, ShareUpdateExclusiveLock);
+ cl_index = NULL;
+ }
+ /* For the same reason, unlock TOAST relation. */
+ if (OldHeap->rd_rel->reltoastrelid)
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+ /* Finally lock the table */
+ LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+ /*
+ * Lock all indexes now, not only the clustering one: all indexes need to
+ * have their files swapped. While doing that, store their relation
+ * references in an array, to handle predicate locks below.
+ */
+ ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+ nind = 0;
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+ Relation index;
+
+ ind_oid = lfirst_oid(lc);
+ index = index_open(ind_oid, AccessExclusiveLock);
+ /*
+ * TODO 1) Do we need to check if ALTER INDEX was executed since the
+ * new index was created in build_new_indexes()? 2) Specifically for
+ * the clustering index, should check_index_is_clusterable() be called
+ * here? (Not sure about the latter: ShareUpdateExclusiveLock on the
+ * table probably blocks all commands that affect the result of
+ * check_index_is_clusterable().)
+ */
+ *ind_refs_p = index;
+ ind_refs_p++;
+ nind++;
+ }
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+ * lock is needed to swap the files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < nind; i++)
+ {
+ Relation index = ind_refs[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ /*
+ * Even ShareUpdateExclusiveLock should have prevented others from
+ * creating / dropping indexes (even using the CONCURRENTLY option), so we
+ * do not need to check whether the lists match.
+ */
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ lockmode, isTopLevel, ¶ms,
+ CLUSTER_COMMAND_REPACK, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, lockmode);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
- /* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1934,6 +3516,7 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode, bool isTopLevel,
ClusterParams *params, ClusterCommand cmd,
Oid *indexOid_p)
{
@@ -1944,12 +3527,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -2013,7 +3594,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 27c2cb26ef5..c6004b4242a 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -903,7 +903,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ea96947d813..0b9357809c4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5990,6 +5990,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0bdfcd90878..553cbfaa378 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -124,7 +124,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -634,7 +634,8 @@ vacuum(List *relations, VacuumParams *params, BufferAccessStrategy bstrategy,
if (params->options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1998,7 +1999,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2266,7 +2267,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2312,7 +2313,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, &toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 5c41f866cd9..7f644b4bdab 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11897,27 +11897,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK opt_repack_args
+ REPACK opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
- n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->relation = $3 ? (RangeVar *) linitial($3) : NULL;
+ n->indexname = $3 ? (char *) lsecond($3) : NULL;
n->params = NIL;
+ n->concurrent = $2;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' opt_repack_args
+ | REPACK '(' utility_option_list ')' opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
- n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->relation = $6 ? (RangeVar *) linitial($6) : NULL;
+ n->indexname = $6 ? (char *) lsecond($6) : NULL;
n->params = $3;
+ n->concurrent = $5;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index cc03f0706e9..5dc4ae58ffe 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -472,6 +473,88 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it - REPACK CONCURRENTLY is the typical use
+ * case.
+ *
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The problem we solve here is that REPACK CONCURRENTLY generates WAL
+ * when doing changes in the new table. Those changes should not be useful
+ * for any other user (such as logical replication subscription) because
+ * the new table will eventually be dropped (after REPACK CONCURRENTLY has
+ * assigned its file to the "old table").
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * This does happen when 1) raw_heap_insert marks the TOAST
+ * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+ * replays inserts performed by other backends.
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index feaa3ac5ad4..5d552f9ce74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e9ddf39500c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..eb576cdebe5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -352,6 +352,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 559ba9cdb2c..4911642fb3c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 03c7be47855..1223ccd1911 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4920,18 +4920,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..be36bb51d0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -322,14 +322,15 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes, bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
@@ -411,6 +412,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..58356392895 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -623,6 +624,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1627,6 +1630,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1639,6 +1646,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1647,6 +1656,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 3be57c97b3f..0a7e72bc74a 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,13 +52,89 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -60,6 +142,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index f92ff524031..4cbf4d16529 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,18 +59,20 @@
/*
* Progress parameters for REPACK.
*
- * Note: Since REPACK shares some code with CLUSTER, these values are also
- * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
- * introduce a separate set of constants.)
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes little
+ * sense to introduce a separate set of constants.)
*/
#define PROGRESS_REPACK_COMMAND 0
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 7
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/*
* Commands of PROGRESS_REPACK
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecc31a107cd..e3112bab3ee 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3938,6 +3938,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..9bb2f7ae1a8 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 328235044d9..ebaf8fdd268 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1990,17 +1990,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2072,17 +2072,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f6c77dc9c69..814a0ba7b69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -487,6 +487,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1254,6 +1256,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2524,6 +2527,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.43.5
v14-0005-Add-regression-tests.patchtext/x-diffDownload
From 6bdcecc7065bd69c81ce6d1c2f00a960a80013d5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 5/7] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 143 ++++++++++++++++++
6 files changed, 270 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 432fc510ee6..04dcfc900de 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3006,6 +3007,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock", NULL);
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..405d0811b4f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..f919087ca5b
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..0e3c47ba999 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,9 +46,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..a17064462ce
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,143 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(i int, j int);
+ CREATE TABLE data_s2(i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+#
+# XXX Not sure this test is useful now - it was designed for the patch that
+# preserves tuple visibility and which therefore modifies
+# TransactionIdIsCurrentTransactionId().
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+#
+# XXX Is this test useful? See above.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.43.5
v14-0006-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From 62db4ecf901980e2cb90d75ca913e2e3dc291590 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:42 +0200
Subject: [PATCH 6/7] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 5 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 290 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 021153b2a5f..fc8df5b6f3d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11234,6 +11234,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9c089a6b3d7..e1313f40599 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -192,7 +192,10 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
too many data changes have been done to the table while
<command>REPACK</command> was waiting for the lock: those changes must
be processed just before the files are swapped, while the
- <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c829c06f769..03e722347a1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -986,7 +986,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 04dcfc900de..0ebc7eacad9 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -89,6 +91,15 @@ typedef struct
RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -132,7 +143,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -148,13 +160,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -2352,7 +2366,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -2382,6 +2397,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -2422,7 +2440,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2452,6 +2471,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -2552,10 +2574,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -2737,11 +2771,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -2750,10 +2788,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -2765,7 +2812,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -2773,6 +2820,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -2934,6 +3003,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Relation *ind_refs,
*ind_refs_p;
int nind;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3029,7 +3100,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3125,9 +3197,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock", NULL);
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f04bfedb2fd..dac58787f30 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -42,8 +42,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2839,6 +2840,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..42d32a2c198 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -765,6 +765,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 0a7e72bc74a..4914f217267 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -134,7 +136,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index f919087ca5b..02967ed9d48 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index a17064462ce..d0fa38dd8cd 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -130,6 +130,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -141,3 +169,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.43.5
v14-0007-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From f79a8790125ed63c21e9b099508660a09d6b552e Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Jun 2025 12:00:43 +0200
Subject: [PATCH 7/7] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
doc/src/sgml/ref/repack.sgml | 7 -
src/backend/access/transam/parallel.c | 8 +
src/backend/access/transam/xact.c | 106 ++++-
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 387 +++++++++++++++++-
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/standby.c | 4 +-
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 4 +
src/include/access/xlog.h | 15 +-
src/include/commands/cluster.h | 5 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 9 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
18 files changed, 540 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index e1313f40599..0fd767eef98 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -260,13 +260,6 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
</para>
</listitem>
- <listitem>
- <para>
- The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
- configuration parameter is less than <literal>logical</literal>.
- </para>
- </listitem>
-
<listitem>
<para>
The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f2de587a1..be568f70961 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -126,6 +127,12 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -638,6 +645,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -652,6 +660,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -681,20 +715,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -719,6 +739,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2216,6 +2284,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1914859b2ee..f9c0e947ba4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0ebc7eacad9..be383a27712 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -84,6 +84,14 @@ typedef struct
* The following definitions are used for concurrent processing.
*/
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
/*
* The locators are used to avoid logical decoding of data that we do not need
* for our table.
@@ -135,8 +143,10 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
static LogicalDecodingContext *setup_logical_decoding(Oid relid,
const char *slotname,
TupleDesc tupdesc);
@@ -383,6 +393,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
Relation index;
bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
LOCKMODE lmode;
+ bool entered,
+ success;
/*
* Check that the correct lock is held. The lock mode is
@@ -558,23 +570,31 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
+ entered = false;
+ success = false;
PG_TRY();
{
/*
- * For concurrent processing, make sure that our logical decoding
- * ignores data changes of other tables than the one we are
- * processing.
+ * For concurrent processing, make sure that
+ *
+ * 1) our logical decoding ignores data changes of other tables than
+ * the one we are processing.
+ *
+ * 2) other transactions know that REPACK CONCURRENTLY is in progress
+ * for our table, so they write sufficient information to WAL even if
+ * wal_level is < LOGICAL.
*/
if (concurrent)
- begin_concurrent_repack(OldHeap);
+ begin_concurrent_repack(OldHeap, &index, &entered);
rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
cmd);
+ success = true;
}
PG_FINALLY();
{
- if (concurrent)
- end_concurrent_repack();
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
}
PG_END_TRY();
@@ -2208,6 +2228,49 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
#define REPL_PLUGIN_NAME "pgoutput_repack"
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRelsHash hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+/* Hashtable of RepackedRel elements. */
+static HTAB *repackedRelsHash = NULL;;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Multiply by two because TOAST relations
+ * also need to be added to the hashtable.
+ */
+#define MAX_REPACKED_RELS (max_replication_slots * 2)
+
+Size
+RepackShmemSize(void)
+{
+ return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+ repackedRelsHash = ShmemInitHash("Repacked Relations Hash",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS |
+ HASH_FIXED_SIZE);
+}
+
/*
* Call this function before REPACK CONCURRENTLY starts to setup logical
* decoding. It makes sure that other users of the table put enough
@@ -2222,11 +2285,150 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
*
* Note that TOAST table needs no attention here as it's not scanned using
* historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into repackedRelsHash or not.
*/
static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
{
- Oid toastrelid;
+ Oid relid,
+ toastrelid;
+ Relation index = NULL;
+ Oid indexid = InvalidOid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+ index = index_p ? *index_p : NULL;
+
+ /*
+ * Make sure that we do not leave an entry in repackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+
+ /*
+ * If we could enter the main fork the TOAST should succeed too.
+ * Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry as soon
+ * as they open our table.
+ */
+ CacheInvalidateRelcacheImmediate(relid);
+
+ /*
+ * Also make sure that the existing users of the table update their
+ * relcache entry as soon as they try to run DML commands on it.
+ *
+ * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+ * has a lower lock, we assume it'll accept our invalidation message when
+ * it changes the lock mode.
+ *
+ * Before upgrading the lock on the relation, close the index temporarily
+ * to avoid a deadlock if another backend running DML already has its lock
+ * (ShareLock) on the table and waits for the lock on the index.
+ */
+ if (index)
+ {
+ indexid = RelationGetRelid(index);
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+ LockRelationOid(relid, ShareLock);
+ UnlockRelationOid(relid, ShareLock);
+ if (OidIsValid(indexid))
+ {
+ /*
+ * Re-open the index and check that it hasn't changed while unlocked.
+ */
+ check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock);
+
+ /*
+ * Return the new relcache entry to the caller. (It's been locked by
+ * the call above.)
+ */
+ index = index_open(indexid, NoLock);
+ *index_p = index;
+ }
/* Avoid logical decoding of other relations by this backend. */
repacked_rel_locator = rel->rd_locator;
@@ -2244,15 +2446,176 @@ begin_concurrent_repack(Relation rel)
/*
* Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
*/
static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
{
+ RepackedRel key;
+ RepackedRel *entry = NULL;
+ RepackedRel *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+
+ entry = hash_search(repackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(repackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make others refresh their information whether they should still
+ * treat the table as catalog from the perspective of writing WAL.
+ *
+ * XXX Unlike entering the entry into the hashtable, we do not bother
+ * with locking and unlocking the table here:
+ *
+ * 1) On normal completion (and sometimes even on ERROR), the caller
+ * is already holding AccessExclusiveLock on the table, so there
+ * should be no relcache reference unaware of this change.
+ *
+ * 2) In the other cases, the worst scenario is that the other
+ * backends will write unnecessary information to WAL until they close
+ * the relation.
+ *
+ * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+ * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+ * that probably should not try to acquire any lock.)
+ */
+ CacheInvalidateRelcacheImmediate(repacked_rel);
+
+ /*
+ * By clearing repacked_rel we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ repacked_rel_toast = InvalidOid;
+ }
+
/*
* Restore normal function of (future) logical decoding for this backend.
*/
repacked_rel_locator.relNumber = InvalidOid;
repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ }
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ /* For particular relation we need to search in the hashtable. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just check the number of entries.
+ */
+ if (!OidIsValid(relid))
+ {
+ long n = hash_get_num_entries(repackedRelsHash);
+
+ LWLockRelease(RepackedRelsLock);
+ return n > 0;
+ }
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
}
/*
@@ -2282,7 +2645,7 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
+ * table we are processing is present in repackedRelsHash and therefore,
* regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1d56d0c4ef3..2cde79c9ed4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -30,6 +30,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -112,10 +113,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e9ddf39500c..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -151,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -344,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 7fa8d9247e0..ab30d448d42 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1325,13 +1325,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 02505c88b8e..ecaa2283c2a 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1643,6 +1643,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = relid;
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4911642fb3c..504cb8e56a8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1279,6 +1279,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4914f217267..9d5a30d0689 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -150,5 +150,10 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..cc84592eb1f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -708,12 +711,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 405d0811b4f..4f6c0ca3a8a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 0e3c47ba999..716e5619aa7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -50,9 +50,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 814a0ba7b69..e3be1f42ccf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2527,6 +2527,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.43.5
jian he <jian.universality@gmail.com> wrote:
hi.
some more minor comments about v13-0001.GetCommandLogLevel also needs to specify LogStmtLevel for T_RepackStmt?
Fixed in [1]/messages/by-id/117560.1749464355@localhost.
/*
* (CLUSTER might change the order of
* rows on disk, which could affect the ordering of pg_dump
* output, but that's not semantically significant.)
*/
do we need adjust this comment in ClassifyUtilityCommandAsReadOnly
for the REPACK statement?
Not sure. The current version does not mention VACUUM. (Note that VACUUM FULL
does almost the same as CLUSTER.) We can adjust the comment during the removal
of CLUSTER sometime in the future.
<para>
<productname>PostgreSQL</productname> has the ability to report the
progress of
certain commands during command execution. Currently, the only commands
which support progress reporting are <command>ANALYZE</command>,
<command>CLUSTER</command>,
<command>CREATE INDEX</command>, <command>VACUUM</command>,
<command>COPY</command>,
and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
command that <xref linkend="app-pgbasebackup"/> issues to take
a base backup).
This may be expanded in the future.
</para>
also need to mention <command>REPACK</command>?
Fixed in [1]/messages/by-id/117560.1749464355@localhost.
"The CLUSTER command is deprecated",
then do we need to say something
in doc/src/sgml/ref/clusterdb.sgml?
I'm not convinced at the moment. The page contains a link to the documentation
of CLUSTER, which does contain the deprecation note.
It's not even clear to me whether this utility must be removed. We can adjust
it so it calls REPACK instead of CLUSTER. And while doing that, we may or may
not rename it.
[1]: /messages/by-id/117560.1749464355@localhost
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote:
jian he <jian.universality@gmail.com> wrote:
On Fri, Apr 11, 2025 at 5:28 PM Antonin Houska <ah@cybertec.at> wrote:
Please check the next version [1]. Thanks for your input.
Hi, I’ve briefly experimented with v13-0001.
Thanks! v14 addresses your comments.
v15 is attached. Just rebased so it applies to the current HEAD.
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
Attachments:
v15-0001-Add-REPACK-command.patchtext/x-diffDownload
From 317f5da1e37d7e01715da1ba2067c7d501897cf7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:42 +0200
Subject: [PATCH 1/7] Add REPACK command.
The existing CLUSTER command as well as VACUUM with the FULL option both
reclaim unused space by rewriting table. Now that we want to enhance this
functionality (in particular, by adding a new option CONCURRENTLY), we should
enhance both commands because they are both implemented by the same function
(cluster.c:cluster_rel). However, adding the same option to two different
commands is not very user-friendly. Therefore it was decided to create a new
command and to declare both CLUSTER command and the FULL option of VACUUM
deprecated. Future enhancements to this rewriting code will only affect the
new command.
Like CLUSTER, the REPACK command reorders the table according to the specified
index. Unlike CLUSTER, REPACK does not require the index: if only table is
specified, the command acts as VACUUM FULL. As we don't want to remove CLUSTER
and VACUUM FULL yet, there are three callers of the cluster_rel() function
now: REPACK, CLUSTER and VACUUM FULL. When we need to distinguish who is
calling this function (mostly for logging, but also for progress reporting),
we can no longer use the OID of the clustering index: both REPACK and VACUUM
FULL can pass InvalidOid. Therefore, this patch introduces a new enumeration
type ClusterCommand, and adds an argument of this type to the cluster_rel()
function and to all the functions that need to distinguish the caller.
Like CLUSTER and VACUUM FULL, the REPACK COMMAND without arguments processes
all the tables on which the current user has the MAINTAIN privilege.
A new view pg_stat_progress_repack view is added to monitor the progress of
REPACK. Currently it displays the same information as pg_stat_progress_cluster
(except that column names might differ), but it'll also display the status of
the REPACK CONCURRENTLY command in the future, so the view definitions will
eventually diverge.
Regarding user documentation, the patch moves the information on clustering
from cluster.sgml to the new file repack.sgml. cluster.sgml now contains a
link that points to the related section of repack.sgml. A note on deprecation
and a link to repack.sgml are added to both cluster.sgml and vacuum.sgml.
---
doc/src/sgml/monitoring.sgml | 223 ++++++++++-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/cluster.sgml | 82 +---
doc/src/sgml/ref/repack.sgml | 254 ++++++++++++
doc/src/sgml/ref/vacuum.sgml | 9 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/heap/heapam_handler.c | 32 +-
src/backend/catalog/index.c | 2 +-
src/backend/catalog/system_views.sql | 26 ++
src/backend/commands/cluster.c | 469 +++++++++++++++++------
src/backend/commands/vacuum.c | 3 +-
src/backend/parser/gram.y | 53 ++-
src/backend/tcop/utility.c | 13 +
src/backend/utils/adt/pgstatfuncs.c | 2 +
src/bin/psql/tab-complete.in.c | 33 +-
src/include/commands/cluster.h | 19 +-
src/include/commands/progress.h | 67 +++-
src/include/nodes/parsenodes.h | 13 +
src/include/parser/kwlist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/include/utils/backend_progress.h | 1 +
src/test/regress/expected/cluster.out | 123 ++++++
src/test/regress/expected/rules.out | 23 ++
src/test/regress/sql/cluster.sql | 59 +++
src/tools/pgindent/typedefs.list | 2 +
25 files changed, 1298 insertions(+), 214 deletions(-)
create mode 100644 doc/src/sgml/ref/repack.sgml
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..da883bb22f1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -405,6 +405,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+ <entry>One row for each backend running
+ <command>REPACK</command>, showing current progress. See
+ <xref linkend="repack-progress-reporting"/>.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
<entry>One row for each WAL sender process streaming a base backup,
@@ -5493,7 +5501,8 @@ FROM pg_stat_get_backend_idset() AS backendid;
certain commands during command execution. Currently, the only commands
which support progress reporting are <command>ANALYZE</command>,
<command>CLUSTER</command>,
- <command>CREATE INDEX</command>, <command>VACUUM</command>,
+ <command>CREATE INDEX</command>, <command>REPACK</command>,
+ <command>VACUUM</command>,
<command>COPY</command>,
and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
command that <xref linkend="app-pgbasebackup"/> issues to take
@@ -5952,6 +5961,218 @@ FROM pg_stat_get_backend_idset() AS backendid;
</table>
</sect2>
+ <sect2 id="repack-progress-reporting">
+ <title>REPACK Progress Reporting</title>
+
+ <indexterm>
+ <primary>pg_stat_progress_repack</primary>
+ </indexterm>
+
+ <para>
+ Whenever <command>REPACK</command> is running,
+ the <structname>pg_stat_progress_repack</structname> view will contain a
+ row for each backend that is currently running the command. The tables
+ below describe the information that will be reported and provide
+ information about how to interpret it.
+ </para>
+
+ <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+ <title><structname>pg_stat_progress_repack</structname> View</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ Column Type
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>pid</structfield> <type>integer</type>
+ </para>
+ <para>
+ Process ID of backend.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>datname</structfield> <type>name</type>
+ </para>
+ <para>
+ Name of the database to which this backend is connected.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ OID of the table being repacked.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>phase</structfield> <type>text</type>
+ </para>
+ <para>
+ Current processing phase. See <xref linkend="repack-phases"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>repack_index_relid</structfield> <type>oid</type>
+ </para>
+ <para>
+ If the table is being scanned using an index, this is the OID of the
+ index being used; otherwise, it is zero.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples scanned.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples written.
+ This counter only advances when the phase is
+ <literal>seq scanning heap</literal>,
+ <literal>index scanning heap</literal>
+ or <literal>writing new heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_total</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of heap blocks in the table. This number is reported
+ as of the beginning of <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap blocks scanned. This counter only advances when the
+ phase is <literal>seq scanning heap</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>index_rebuild_count</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of indexes rebuilt. This counter only advances when the phase
+ is <literal>rebuilding index</literal>.
+ </para></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <table id="repack-phases">
+ <title>REPACK Phases</title>
+ <tgroup cols="2">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry>Phase</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><literal>initializing</literal></entry>
+ <entry>
+ The command is preparing to begin scanning the heap. This phase is
+ expected to be very brief.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>seq scanning heap</literal></entry>
+ <entry>
+ The command is currently scanning the table using a sequential scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>index scanning heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently scanning the table using an index scan.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>sorting tuples</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently sorting tuples.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>writing new heap</literal></entry>
+ <entry>
+ <command>REPACK</command> is currently writing the new heap.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>swapping relation files</literal></entry>
+ <entry>
+ The command is currently swapping newly-built files into place.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>rebuilding index</literal></entry>
+ <entry>
+ The command is currently rebuilding an index.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>performing final cleanup</literal></entry>
+ <entry>
+ The command is performing final cleanup. When this phase is
+ completed, <command>REPACK</command> will end.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect2>
+
<sect2 id="copy-progress-reporting">
<title>COPY Progress Reporting</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..c0ef654fcb4 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
<!ENTITY reindex SYSTEM "reindex.sgml">
<!ENTITY releaseSavepoint SYSTEM "release_savepoint.sgml">
+<!ENTITY repack SYSTEM "repack.sgml">
<!ENTITY reset SYSTEM "reset.sgml">
<!ENTITY revoke SYSTEM "revoke.sgml">
<!ENTITY rollback SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..ee4fd965928 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -42,18 +42,6 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
<replaceable class="parameter">table_name</replaceable>.
</para>
- <para>
- When a table is clustered, it is physically reordered
- based on the index information. Clustering is a one-time operation:
- when the table is subsequently updated, the changes are
- not clustered. That is, no attempt is made to store new or
- updated rows according to their index order. (If one wishes, one can
- periodically recluster by issuing the command again. Also, setting
- the table's <literal>fillfactor</literal> storage parameter to less than
- 100% can aid in preserving cluster ordering during updates, since updated
- rows are kept on the same page if enough space is available there.)
- </para>
-
<para>
When a table is clustered, <productname>PostgreSQL</productname>
remembers which index it was clustered by. The form
@@ -78,6 +66,25 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
database operations (both reads and writes) from operating on the
table until the <command>CLUSTER</command> is finished.
</para>
+
+ <warning>
+ <para>
+ The <command>CLUSTER</command> command is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
+ <note>
+ <para>
+ <xref linkend="sql-repack-notes-on-clustering"/> explain how clustering
+ works, whether it is initiated by <command>CLUSTER</command> or
+ by <command>REPACK</command>. The notable difference between the two is
+ that <command>REPACK</command> does not remember the index used last
+ time. Thus if you don't specify an index, <command>REPACK</command>
+ rewrites the table but does not try to cluster it.
+ </para>
+ </note>
+
</refsect1>
<refsect1>
@@ -136,63 +143,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
on the table.
</para>
- <para>
- In cases where you are accessing single rows randomly
- within a table, the actual order of the data in the
- table is unimportant. However, if you tend to access some
- data more than others, and there is an index that groups
- them together, you will benefit from using <command>CLUSTER</command>.
- If you are requesting a range of indexed values from a table, or a
- single indexed value that has multiple rows that match,
- <command>CLUSTER</command> will help because once the index identifies the
- table page for the first row that matches, all other rows
- that match are probably already on the same table page,
- and so you save disk accesses and speed up the query.
- </para>
-
- <para>
- <command>CLUSTER</command> can re-sort the table using either an index scan
- on the specified index, or (if the index is a b-tree) a sequential
- scan followed by sorting. It will attempt to choose the method that
- will be faster, based on planner cost parameters and available statistical
- information.
- </para>
-
<para>
While <command>CLUSTER</command> is running, the <xref
linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
pg_temp</literal>.
</para>
- <para>
- When an index scan is used, a temporary copy of the table is created that
- contains the table data in the index order. Temporary copies of each
- index on the table are created as well. Therefore, you need free space on
- disk at least equal to the sum of the table size and the index sizes.
- </para>
-
- <para>
- When a sequential scan and sort is used, a temporary sort file is
- also created, so that the peak temporary space requirement is as much
- as double the table size, plus the index sizes. This method is often
- faster than the index scan method, but if the disk space requirement is
- intolerable, you can disable this choice by temporarily setting <xref
- linkend="guc-enable-sort"/> to <literal>off</literal>.
- </para>
-
- <para>
- It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
- a reasonably large value (but not more than the amount of RAM you can
- dedicate to the <command>CLUSTER</command> operation) before clustering.
- </para>
-
- <para>
- Because the planner records statistics about the ordering of
- tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
- on the newly clustered table.
- Otherwise, the planner might make poor choices of query plans.
- </para>
-
<para>
Because <command>CLUSTER</command> remembers which indexes are clustered,
one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..a612c72d971
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,254 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+ <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>REPACK</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>REPACK</refname>
+ <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+ VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ <command>REPACK</command> reclaims storage occupied by dead
+ tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+ entire contents of the table specified
+ by <replaceable class="parameter">table_name</replaceable> into a new disk
+ file with no extra space (except for the space guaranteed by
+ the <literal>fillfactor</literal> storage parameter), allowing unused space
+ to be returned to the operating system.
+ </para>
+
+ <para>
+ Without
+ a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+ processes every table and materialized view in the current database that
+ the current user has the <literal>MAINTAIN</literal> privilege on. This
+ form of <command>REPACK</command> cannot be executed inside a transaction
+ block.
+ </para>
+
+ <para>
+ If <replaceable class="parameter">index_name</replaceable> is specified,
+ the table is clustered by this index. Please see the notes on clustering
+ below.
+ </para>
+
+ <para>
+ When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+ is acquired on it. This prevents any other database operations (both reads
+ and writes) from operating on the table until the <command>REPACK</command>
+ is finished.
+ </para>
+
+ <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+ <title>Notes on Clustering</title>
+
+ <para>
+ When a table is clustered, it is physically reordered based on the index
+ information. Clustering is a one-time operation: when the table is
+ subsequently updated, the changes are not clustered. That is, no attempt
+ is made to store new or updated rows according to their index order. (If
+ one wishes, one can periodically recluster by issuing the command again.
+ Also, setting the table's <literal>fillfactor</literal> storage parameter
+ to less than 100% can aid in preserving cluster ordering during updates,
+ since updated rows are kept on the same page if enough space is available
+ there.)
+ </para>
+
+ <para>
+ In cases where you are accessing single rows randomly within a table, the
+ actual order of the data in the table is unimportant. However, if you tend
+ to access some data more than others, and there is an index that groups
+ them together, you will benefit from using <command>REPACK</command>. If
+ you are requesting a range of indexed values from a table, or a single
+ indexed value that has multiple rows that match,
+ <command>REPACK</command> will help because once the index identifies the
+ table page for the first row that matches, all other rows that match are
+ probably already on the same table page, and so you save disk accesses and
+ speed up the query.
+ </para>
+
+ <para>
+ <command>REPACK</command> can re-sort the table using either an index scan
+ on the specified index (if the index is a b-tree), or a sequential scan
+ followed by sorting. It will attempt to choose the method that will be
+ faster, based on planner cost parameters and available statistical
+ information.
+ </para>
+
+ <para>
+ Because the planner records statistics about the ordering of tables, it is
+ advisable to
+ run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+ newly repacked table. Otherwise, the planner might make poor choices of
+ query plans.
+ </para>
+ </refsect2>
+
+ <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+ <title>Notes on Resources</title>
+
+ <para>
+ When an index scan or a sequential scan without sort is used, a temporary
+ copy of the table is created that contains the table data in the index
+ order. Temporary copies of each index on the table are created as well.
+ Therefore, you need free space on disk at least equal to the sum of the
+ table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is also
+ created, so that the peak temporary space requirement is as much as double
+ the table size, plus the index sizes. This method is often faster than
+ the index scan method, but if the disk space requirement is intolerable,
+ you can disable this choice by temporarily setting
+ <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+ reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>REPACK</command> operation) before repacking.
+ </para>
+ </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">table_name</replaceable></term>
+ <listitem>
+ <para>
+ The name (possibly schema-qualified) of a table.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">index_name</replaceable></term>
+ <listitem>
+ <para>
+ The name of an index.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>VERBOSE</literal></term>
+ <listitem>
+ <para>
+ Prints a progress report as each table is repacked
+ at <literal>INFO</literal> level.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">boolean</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether the selected option should be turned on or off.
+ You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+ <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+ <literal>OFF</literal>, or <literal>0</literal> to disable it. The
+ <replaceable class="parameter">boolean</replaceable> value can also
+ be omitted, in which case <literal>TRUE</literal> is assumed.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+ on the table.
+ </para>
+
+ <para>
+ While <command>REPACK</command> is running, the <xref
+ linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+ pg_temp</literal>.
+ </para>
+
+ <para>
+ Each backend running <command>REPACK</command> will report its progress
+ in the <structname>pg_stat_progress_repack</structname> view. See
+ <xref linkend="repack-progress-reporting"/> for details.
+ </para>
+
+ <para>
+ Repacking a partitioned table repacks each of its partitions. If an index
+ is specified, each partition is repacked using the partition of that
+ index. <command>REPACK</command> on a partitioned table cannot be executed
+ inside a transaction block.
+ </para>
+
+ </refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+ </para>
+
+ <para>
+ Repack the table <literal>employees</literal> on the basis of its
+ index <literal>employees_ind</literal> (Since index is used here, this is
+ effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+ </para>
+
+ <para>
+ Repack all tables in the database on which you have
+ the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+ <title>Compatibility</title>
+
+ <para>
+ There is no <command>REPACK</command> statement in the SQL standard.
+ </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..cee1cf3926c 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -98,6 +98,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
<varlistentry>
<term><literal>FULL</literal></term>
<listitem>
+
<para>
Selects <quote>full</quote> vacuum, which can reclaim more
space, but takes much longer and exclusively locks the table.
@@ -106,6 +107,14 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
the operation is complete. Usually this should only be used when a
significant amount of space needs to be reclaimed from within the table.
</para>
+
+ <warning>
+ <para>
+ The <option>FULL</option> parameter is deprecated in favor of
+ <xref linkend="sql-repack"/>.
+ </para>
+ </warning>
+
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..229912d35b7 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
&refreshMaterializedView;
&reindex;
&releaseSavepoint;
+ &repack;
&reset;
&revoke;
&rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..0b03070d394 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
if (OldIndex != NULL && !use_sort)
{
const int ci_index[] = {
- PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_INDEX_RELID
+ PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_INDEX_RELID
};
int64 ci_val[2];
/* Set phase and OIDOldIndex to columns */
- ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+ ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
ci_val[1] = RelationGetRelid(OldIndex);
pgstat_progress_update_multi_param(2, ci_index, ci_val);
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
else
{
/* In scan-and-sort mode and also VACUUM FULL, set phase */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
/* Set total heap blocks */
- pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+ pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
heapScan->rs_nblocks);
}
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* is manually updated to the correct value when the table
* scan finishes.
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
heapScan->rs_nblocks);
break;
}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*/
if (prev_cblock != heapScan->rs_cblock)
{
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
(heapScan->rs_cblock +
heapScan->rs_nblocks -
heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* In scan-and-sort mode, report increase in number of tuples
* scanned
*/
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
*num_tuples);
}
else
{
const int ct_index[] = {
- PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
- PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+ PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
};
int64 ct_val[2];
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
double n_tuples = 0;
/* Report that we are now sorting tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SORT_TUPLES);
tuplesort_performsort(tuplesort);
/* Report that we are now writing new heap */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
for (;;)
{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
n_tuples);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index aa216683b74..96357d1170c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
Assert(!ReindexIsProcessingIndex(indexOid));
/* Set index rebuild count */
- pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+ pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
i);
i++;
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 08f780a2e63..7380b6e3d7b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1271,6 +1271,32 @@ CREATE VIEW pg_stat_progress_cluster AS
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
+CREATE VIEW pg_stat_progress_repack AS
+ SELECT
+ S.pid AS pid,
+ S.datid AS datid,
+ D.datname AS datname,
+ S.relid AS relid,
+ -- param1 is currently unused
+ CASE S.param2 WHEN 0 THEN 'initializing'
+ WHEN 1 THEN 'seq scanning heap'
+ WHEN 2 THEN 'index scanning heap'
+ WHEN 3 THEN 'sorting tuples'
+ WHEN 4 THEN 'writing new heap'
+ WHEN 5 THEN 'swapping relation files'
+ WHEN 6 THEN 'rebuilding index'
+ WHEN 7 THEN 'performing final cleanup'
+ END AS phase,
+ CAST(S.param3 AS oid) AS repack_index_relid,
+ S.param4 AS heap_tuples_scanned,
+ S.param5 AS heap_tuples_written,
+ S.param6 AS heap_blks_total,
+ S.param7 AS heap_blks_scanned,
+ S.param8 AS index_rebuild_count
+ FROM pg_stat_get_progress_info('REPACK') AS S
+ LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
CREATE VIEW pg_stat_progress_create_index AS
SELECT
S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b55221d44cd..5e94b570431 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,17 +67,24 @@ typedef struct
Oid indexOid;
} RelToCluster;
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
+ ClusterCommand cmd);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
bool verbose, bool *pSwapToastByContent,
TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
- Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
-
+ Oid relid, bool rel_is_index,
+ ClusterCommand cmd);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
+ ClusterCommand cmd);
+static Relation process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params,
+ ClusterCommand cmd,
+ Oid *indexOid_p);
/*---------------------------------------------------------------------------
* This cluster code allows for clustering multiple tables at once. Because
@@ -134,71 +141,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
/* This is the single-relation case. */
- Oid tableOid;
-
- /*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
- */
- tableOid = RangeVarGetRelidExtended(stmt->relation,
- AccessExclusiveLock,
- 0,
- RangeVarCallbackMaintainsTable,
- NULL);
- rel = table_open(tableOid, NoLock);
-
- /*
- * Reject clustering a remote temp table ... their local buffer
- * manager is not going to cope.
- */
- if (RELATION_IS_OTHER_TEMP(rel))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot cluster temporary tables of other sessions")));
-
- if (stmt->indexname == NULL)
- {
- ListCell *index;
-
- /* We need to find the index that has indisclustered set. */
- foreach(index, RelationGetIndexList(rel))
- {
- indexOid = lfirst_oid(index);
- if (get_index_isclustered(indexOid))
- break;
- indexOid = InvalidOid;
- }
-
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("there is no previously clustered index for table \"%s\"",
- stmt->relation->relname)));
- }
- else
- {
- /*
- * The index is expected to be in the same namespace as the
- * relation.
- */
- indexOid = get_relname_relid(stmt->indexname,
- rel->rd_rel->relnamespace);
- if (!OidIsValid(indexOid))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("index \"%s\" for table \"%s\" does not exist",
- stmt->indexname, stmt->relation->relname)));
- }
-
- /* For non-partitioned tables, do what we came here to do. */
- if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
- {
- cluster_rel(rel, indexOid, ¶ms);
- /* cluster_rel closes the relation, but keeps lock */
-
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_CLUSTER,
+ &indexOid);
+ if (rel == NULL)
return;
- }
}
/*
@@ -231,7 +178,9 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
{
Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
check_index_is_clusterable(rel, indexOid, AccessShareLock);
- rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+ rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid,
+ true,
+ CLUSTER_COMMAND_CLUSTER);
/* close relation, releasing lock on parent table */
table_close(rel, AccessExclusiveLock);
@@ -243,7 +192,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -260,7 +209,7 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
{
ListCell *lc;
@@ -283,7 +232,7 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
rel = table_open(rtc->tableOid, AccessExclusiveLock);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params);
+ cluster_rel(rel, rtc->indexOid, params, cmd);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -306,9 +255,13 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
* If indexOid is InvalidOid, the table will be rewritten in physical order
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
+ *
+ * 'cmd' indicates which command is being executed. REPACK should be the only
+ * caller of this function in the future.
*/
void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -323,13 +276,26 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
- pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
- if (OidIsValid(indexOid))
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+ else
+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+ if (cmd == CLUSTER_COMMAND_REPACK)
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+ PROGRESS_REPACK_COMMAND_REPACK);
+ else if (OidIsValid(indexOid))
+ {
+ Assert(cmd == CLUSTER_COMMAND_CLUSTER);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_CLUSTER);
+ }
else
- pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+ {
+ Assert(cmd == CLUSTER_COMMAND_VACUUM);
+ pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+ }
/*
* Switch to the table owner's userid, so that any index functions are run
@@ -353,7 +319,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
if (recheck)
{
/* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid))
+ if (!cluster_is_permitted_for_relation(tableOid, save_userid,
+ CLUSTER_COMMAND_CLUSTER))
{
relation_close(OldHeap, AccessExclusiveLock);
goto out;
@@ -403,8 +370,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
* would work in most respects, but the index would only get marked as
* indisclustered in the current database, leading to unexpected behavior
* if CLUSTER were later invoked in another database.
+ *
+ * REPACK does not set indisclustered. XXX Not sure I understand the
+ * comment above: how can an attribute be set "only in the current
+ * database"?
*/
- if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+ if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
@@ -415,21 +386,33 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- if (OidIsValid(indexOid))
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster temporary tables of other sessions")));
+ else if (cmd == CLUSTER_COMMAND_REPACK)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
else
+ {
+ Assert (cmd == CLUSTER_COMMAND_VACUUM);
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot vacuum temporary tables of other sessions")));
+ }
}
/*
* Also check for active uses of the relation in the current transaction,
* including open scans and pending AFTER trigger events.
*/
- CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+ CheckTableNotInUse(OldHeap,
+ (cmd == CLUSTER_COMMAND_CLUSTER ?
+ "CLUSTER" : (cmd == CLUSTER_COMMAND_REPACK ?
+ "REPACK" : "VACUUM")));
/* Check heap and index are valid to cluster on */
if (OidIsValid(indexOid))
@@ -469,7 +452,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose);
+ rebuild_relation(OldHeap, index, verbose, cmd);
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -626,7 +609,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
* On exit, they are closed, but locks on them are not released.
*/
static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+ ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -642,7 +626,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
(index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
- if (index)
+ if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
mark_index_clustered(OldHeap, RelationGetRelid(index), true);
@@ -1458,8 +1442,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
int i;
/* Report that we are now swapping relation files */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1493,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
/* Report that we are now doing clean up */
- pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
- PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
/*
* If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1666,7 +1650,8 @@ get_tables_to_cluster(MemoryContext cluster_context)
index = (Form_pg_index) GETSTRUCT(indexTuple);
- if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId(),
+ CLUSTER_COMMAND_CLUSTER))
continue;
/* Use a permanent memory context for the result list */
@@ -1687,14 +1672,68 @@ get_tables_to_cluster(MemoryContext cluster_context)
}
/*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Like get_tables_to_cluster(), but do not care about indexes.
+ */
+static List *
+get_tables_to_repack(MemoryContext repack_context)
+{
+ Relation relrelation;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ MemoryContext old_context;
+ List *rtcs = NIL;
+
+ /*
+ * Get all relations that the current user has the appropriate privileges
+ * for.
+ */
+ relrelation = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(relrelation, 0, NULL);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ RelToCluster *rtc;
+ Form_pg_class relrelation = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relid = relrelation->oid;
+ char relkind = get_rel_relkind(relid);
+
+ /* Only interested in relations. */
+ if (relkind != RELKIND_RELATION && relkind != RELKIND_MATVIEW)
+ continue;
+
+ if (!cluster_is_permitted_for_relation(relid, GetUserId(),
+ CLUSTER_COMMAND_REPACK))
+ continue;
+
+ /* Use a permanent memory context for the result list */
+ old_context = MemoryContextSwitchTo(repack_context);
+
+ rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+ rtc->tableOid = relid;
+ rtc->indexOid = InvalidOid;
+ rtcs = lappend(rtcs, rtc);
+
+ MemoryContextSwitchTo(old_context);
+ }
+ table_endscan(scan);
+
+ relation_close(relrelation, AccessShareLock);
+
+ return rtcs;
+}
+
+/*
+ * Given a partitioned table or its index, return a list of RelToCluster for
* all the children leaves tables/indexes.
*
* Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
* on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
*/
static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid relid,
+ bool rel_is_index, ClusterCommand cmd)
{
List *inhoids;
ListCell *lc;
@@ -1702,17 +1741,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
MemoryContext old_context;
/* Do not lock the children until they're processed */
- inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+ inhoids = find_all_inheritors(relid, NoLock, NULL);
foreach(lc, inhoids)
{
- Oid indexrelid = lfirst_oid(lc);
- Oid relid = IndexGetRelation(indexrelid, false);
+ Oid inhoid = lfirst_oid(lc);
+ Oid inhrelid,
+ inhindid;
RelToCluster *rtc;
- /* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
- continue;
+ if (rel_is_index)
+ {
+ /* consider only leaf indexes */
+ if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+ continue;
+
+ inhrelid = IndexGetRelation(inhoid, false);
+ inhindid = inhoid;
+ }
+ else
+ {
+ /* consider only leaf relations */
+ if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+ continue;
+
+ inhrelid = inhoid;
+ inhindid = InvalidOid;
+ }
/*
* It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1775,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* table. We skip any partitions which the user is not permitted to
* CLUSTER.
*/
- if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+ if (!cluster_is_permitted_for_relation(inhrelid, GetUserId(), cmd))
continue;
/* Use a permanent memory context for the result list */
old_context = MemoryContextSwitchTo(cluster_context);
rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
- rtc->tableOid = relid;
- rtc->indexOid = indexrelid;
+ rtc->tableOid = inhrelid;
+ rtc->indexOid = inhindid;
rtcs = lappend(rtcs, rtc);
MemoryContextSwitchTo(old_context);
@@ -1742,13 +1797,211 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
* function emits a WARNING.
*/
static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
{
if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
return true;
- ereport(WARNING,
- (errmsg("permission denied to cluster \"%s\", skipping it",
- get_rel_name(relid))));
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(WARNING,
+ (errmsg("permission denied to cluster \"%s\", skipping it",
+ get_rel_name(relid))));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(WARNING,
+ (errmsg("permission denied to repack \"%s\", skipping it",
+ get_rel_name(relid))));
+ }
+
return false;
}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options = (verbose ? CLUOPT_VERBOSE : 0);
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ ¶ms, CLUSTER_COMMAND_REPACK,
+ &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation. In order to avoid
+ * holding locks for too long, we want to process each table in its own
+ * transaction. This forces us to disallow running inside a user
+ * transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, AccessExclusiveLock);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
+
+ /* Start a new transaction for the cleanup work. */
+ StartTransactionCommand();
+
+ /* Clean up working storage */
+ MemoryContextDelete(repack_context);
+
+}
+
+/*
+ * REPACK a single relation if it's a non-partitioned table or a leaf
+ * partition and return NULL. Return the relation's relcache entry if the
+ * caller needs to process it (because the relation is partitioned).
+ */
+static Relation
+process_single_relation(RangeVar *relation, char *indexname,
+ ClusterParams *params, ClusterCommand cmd,
+ Oid *indexOid_p)
+{
+ Relation rel;
+ Oid indexOid = InvalidOid;
+
+ /* This is the single-relation case. */
+ Oid tableOid;
+
+ /*
+ * Find, lock, and check permissions on the table. We obtain
+ * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+ * single-transaction case.
+ */
+ tableOid = RangeVarGetRelidExtended(relation,
+ AccessExclusiveLock,
+ 0,
+ RangeVarCallbackMaintainsTable,
+ NULL);
+ rel = table_open(tableOid, NoLock);
+
+ /*
+ * Reject clustering a remote temp table ... their local buffer manager is
+ * not going to cope.
+ */
+ if (RELATION_IS_OTHER_TEMP(rel))
+ {
+ if (cmd == CLUSTER_COMMAND_CLUSTER)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot cluster temporary tables of other sessions")));
+ else
+ {
+ Assert(cmd == CLUSTER_COMMAND_REPACK);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack temporary tables of other sessions")));
+ }
+ }
+
+ if (indexname == NULL && cmd == CLUSTER_COMMAND_CLUSTER)
+ {
+ ListCell *index;
+
+ /* We need to find the index that has indisclustered set. */
+ foreach(index, RelationGetIndexList(rel))
+ {
+ indexOid = lfirst_oid(index);
+ if (get_index_isclustered(indexOid))
+ break;
+ indexOid = InvalidOid;
+ }
+
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("there is no previously clustered index for table \"%s\"",
+ relation->relname)));
+ }
+ else if (indexname != NULL)
+ {
+ /*
+ * The index is expected to be in the same namespace as the relation.
+ */
+ indexOid = get_relname_relid(indexname,
+ rel->rd_rel->relnamespace);
+ if (!OidIsValid(indexOid))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("index \"%s\" for table \"%s\" does not exist",
+ indexname, relation->relname)));
+ }
+
+ *indexOid_p = indexOid;
+
+ /* For non-partitioned tables, do what we came here to do. */
+ if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+ {
+ cluster_rel(rel, indexOid, params, cmd);
+ /* cluster_rel closes the relation, but keeps lock */
+
+ return NULL;
+ }
+
+ return rel;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..8685942505c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2287,7 +2287,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
cluster_params.options |= CLUOPT_VERBOSE;
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
- cluster_rel(rel, InvalidOid, &cluster_params);
+ cluster_rel(rel, InvalidOid, &cluster_params,
+ CLUSTER_COMMAND_VACUUM);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 50f53159d58..15b2b5e93ce 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -297,7 +297,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
- RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+ RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
RuleActionStmt RuleActionStmtOrEmpty RuleStmt
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
@@ -380,11 +380,11 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> copy_file_name
access_method_clause attr_name
table_access_method_clause name cursor_name file_name
- cluster_index_specification
+ cluster_index_specification repack_index_specification
%type <list> func_name handler_name qual_Op qual_all_Op subquery_Op
opt_inline_handler opt_validator validator_clause
- opt_collate
+ opt_collate opt_repack_args
%type <range> qualified_name insert_target OptConstrFromTable
@@ -763,7 +763,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -1099,6 +1099,7 @@ stmt:
| RemoveFuncStmt
| RemoveOperStmt
| RenameStmt
+ | RepackStmt
| RevokeStmt
| RevokeRoleStmt
| RuleStmt
@@ -11890,6 +11891,48 @@ cluster_index_specification:
| /*EMPTY*/ { $$ = NULL; }
;
+/*****************************************************************************
+ *
+ * QUERY:
+ * REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *****************************************************************************/
+
+RepackStmt:
+ REPACK opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
+ n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->params = NIL;
+ $$ = (Node *) n;
+ }
+
+ | REPACK '(' utility_option_list ')' opt_repack_args
+ {
+ RepackStmt *n = makeNode(RepackStmt);
+
+ n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
+ n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->params = $3;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_repack_args:
+ qualified_name repack_index_specification
+ {
+ $$ = list_make2($1, $2);
+ }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
+
+repack_index_specification:
+ ExistingIndex
+ | /*EMPTY*/ { $$ = NULL; }
+ ;
+
/*****************************************************************************
*
@@ -17907,6 +17950,7 @@ unreserved_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
@@ -18539,6 +18583,7 @@ bare_label_keyword:
| RELATIVE_P
| RELEASE
| RENAME
+ | REPACK
| REPEATABLE
| REPLACE
| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..6acdff4606f 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -280,6 +280,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_ClusterStmt:
case T_ReindexStmt:
case T_VacuumStmt:
+ case T_RepackStmt:
{
/*
* These commands write WAL, so they're not strictly
@@ -862,6 +863,10 @@ standard_ProcessUtility(PlannedStmt *pstmt,
ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
break;
+ case T_RepackStmt:
+ repack(pstate, (RepackStmt *) parsetree, isTopLevel);
+ break;
+
case T_ExplainStmt:
ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
break;
@@ -2869,6 +2874,10 @@ CreateCommandTag(Node *parsetree)
tag = CMDTAG_ANALYZE;
break;
+ case T_RepackStmt:
+ tag = CMDTAG_REPACK;
+ break;
+
case T_ExplainStmt:
tag = CMDTAG_EXPLAIN;
break;
@@ -3510,6 +3519,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_RepackStmt:
+ lev = LOGSTMT_DDL;
+ break;
+
case T_VacuumStmt:
lev = LOGSTMT_ALL;
break;
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1c12ddbae49..b2ad8ba45cd 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
cmdtype = PROGRESS_COMMAND_ANALYZE;
else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
cmdtype = PROGRESS_COMMAND_CLUSTER;
+ else if (pg_strcasecmp(cmd, "REPACK") == 0)
+ cmdtype = PROGRESS_COMMAND_REPACK;
else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 8c2ea0b9587..2eee34cbfa3 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1231,7 +1231,7 @@ static const char *const sql_commands[] = {
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
- "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+ "REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4928,6 +4928,37 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
}
+/* REPACK */
+ else if (Matches("REPACK"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ else if (Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+ /* If we have REPACK <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(")))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK (*) <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAny))
+ COMPLETE_WITH("USING INDEX");
+ /* If we have REPACK <sth> USING, then add the index as well */
+ else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+ {
+ set_completion_reference(prev3_wd);
+ COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+ }
+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ else if (TailMatches("VERBOSE"))
+ COMPLETE_WITH("ON", "OFF");
+ }
+
/* SECURITY LABEL */
else if (Matches("SECURITY"))
COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..3be57c97b3f 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -31,8 +31,24 @@ typedef struct ClusterParams
bits32 options; /* bitmask of CLUOPT_* */
} ClusterParams;
+/*
+ * cluster.c currently implements three nearly identical commands: CLUSTER,
+ * VACUUM FULL and REPACK. Where needed, use this enumeration to distinguish
+ * which of these commands is being executed.
+ *
+ * Remove this stuff when removing the (now deprecated) CLUSTER and VACUUM
+ * FULL commands.
+ */
+typedef enum ClusterCommand
+{
+ CLUSTER_COMMAND_CLUSTER,
+ CLUSTER_COMMAND_REPACK,
+ CLUSTER_COMMAND_VACUUM
+} ClusterCommand;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
+ ClusterCommand cmd);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
@@ -48,4 +64,5 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..f92ff524031 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,55 @@
#define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS 4
#define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE 5
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND 0
-#define PROGRESS_CLUSTER_PHASE 1
-#define PROGRESS_CLUSTER_INDEX_RELID 2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS 5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED 6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT 7
-
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP 1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP 2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES 3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX 6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP 7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, these values are also
+ * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
+ * introduce a separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND 0
+#define PROGRESS_REPACK_PHASE 1
+#define PROGRESS_REPACK_INDEX_RELID 2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ *
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes no sense
+ * to introduce separate set of constants.)
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP 1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+
+/*
+ * Commands of PROGRESS_REPACK
+ *
+ * Currently we only have one command, so the PROGRESS_REPACK_COMMAND
+ * parameter is not necessary. However it makes cluster.c simpler if we have
+ * the same set of parameters for CLUSTER and REPACK - see the note on REPACK
+ * parameters above.
+ */
+#define PROGRESS_REPACK_COMMAND_REPACK 1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
/* Commands of PROGRESS_CLUSTER */
#define PROGRESS_CLUSTER_COMMAND_CLUSTER 1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ba12678d1cb..52584bd8dbf 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3928,6 +3928,19 @@ typedef struct ClusterStmt
List *params; /* list of DefElem nodes */
} ClusterStmt;
+/* ----------------------
+ * Repack Statement
+ * ----------------------
+ */
+typedef struct RepackStmt
+{
+ NodeTag type;
+ RangeVar *relation; /* relation being repacked */
+ char *indexname; /* order tuples by this index */
+ List *params; /* list of DefElem nodes */
+} RepackStmt;
+
+
/* ----------------------
* Vacuum and Analyze Statements
*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..22559369e2c 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -374,6 +374,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..e69e366dcdc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -28,6 +28,7 @@ typedef enum ProgressCommandType
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
+ PROGRESS_COMMAND_REPACK,
} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..e9fd7512710 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,63 @@ ORDER BY 1;
clstr_tst_pkey
(3 rows)
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a | b | c | substring | length
+----+-----+------------------+--------------------------------+--------
+ 10 | 14 | catorce | |
+ 18 | 5 | cinco | |
+ 9 | 4 | cuatro | |
+ 26 | 19 | diecinueve | |
+ 12 | 18 | dieciocho | |
+ 30 | 16 | dieciseis | |
+ 24 | 17 | diecisiete | |
+ 2 | 10 | diez | |
+ 23 | 12 | doce | |
+ 11 | 2 | dos | |
+ 25 | 9 | nueve | |
+ 31 | 8 | ocho | |
+ 1 | 11 | once | |
+ 28 | 15 | quince | |
+ 32 | 6 | seis | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 | 7 | siete | |
+ 15 | 13 | trece | |
+ 22 | 30 | treinta | |
+ 17 | 32 | treinta y dos | |
+ 3 | 31 | treinta y uno | |
+ 5 | 3 | tres | |
+ 20 | 1 | uno | |
+ 6 | 20 | veinte | |
+ 14 | 25 | veinticinco | |
+ 21 | 24 | veinticuatro | |
+ 4 | 22 | veintidos | |
+ 19 | 29 | veintinueve | |
+ 16 | 28 | veintiocho | |
+ 27 | 26 | veintiseis | |
+ 13 | 27 | veintisiete | |
+ 7 | 23 | veintitres | |
+ 8 | 21 | veintiuno | |
+ 0 | 100 | in child table | |
+ 0 | 100 | in child table 2 | |
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR: insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL: Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+ conname
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +438,35 @@ SELECT * FROM clstr_1;
2
(2 rows)
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +581,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ERROR: cannot mark index clustered in partitioned table
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
ERROR: cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+ relname | level | relkind | ?column?
+-------------+-------+---------+----------
+ clstrpart | 0 | p | t
+ clstrpart1 | 1 | p | t
+ clstrpart11 | 2 | r | f
+ clstrpart12 | 2 | p | t
+ clstrpart2 | 1 | r | f
+ clstrpart3 | 1 | p | t
+ clstrpart33 | 2 | r | f
+(7 rows)
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..328235044d9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2062,6 +2062,29 @@ pg_stat_progress_create_index| SELECT s.pid,
s.param15 AS partitions_done
FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param2
+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'seq scanning heap'::text
+ WHEN 2 THEN 'index scanning heap'::text
+ WHEN 3 THEN 'sorting tuples'::text
+ WHEN 4 THEN 'writing new heap'::text
+ WHEN 5 THEN 'swapping relation files'::text
+ WHEN 6 THEN 'rebuilding index'::text
+ WHEN 7 THEN 'performing final cleanup'::text
+ ELSE NULL::text
+ END AS phase,
+ (s.param3)::oid AS repack_index_relid,
+ s.param4 AS heap_tuples_scanned,
+ s.param5 AS heap_tuples_written,
+ s.param6 AS heap_blks_total,
+ s.param7 AS heap_blks_scanned,
+ s.param8 AS index_rebuild_count
+ FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+ LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
s.datid,
d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..cfcc3dc9761 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,19 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
ORDER BY 1;
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
SELECT relname, relkind,
EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +172,34 @@ INSERT INTO clstr_1 VALUES (1);
CLUSTER clstr_1;
SELECT * FROM clstr_1;
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR; -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
-- Test MVCC-safety of cluster. There isn't much we can do to verify the
-- results with a single backend...
@@ -229,6 +270,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
CLUSTER clstrpart;
ALTER TABLE clstrpart SET WITHOUT CLUSTER;
ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
DROP TABLE clstrpart;
-- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 32d6e718adc..255d0e76520 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -427,6 +427,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClusterCommand
ClonePtrType
ClosePortalStmt
ClosePtrType
@@ -2528,6 +2529,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
--
2.47.1
v15-0002-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patchtext/x-diffDownload
From ee22df2bcf21e585dc8f4c37da2ddf2de6059741 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:42 +0200
Subject: [PATCH 2/7] Move conversion of a "historic" to MVCC snapshot to a
separate function.
The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
src/backend/utils/time/snapmgr.c | 3 +-
src/include/replication/snapbuild.h | 1 +
src/include/utils/snapmgr.h | 1 +
4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index adf18c397db..270f37ecadb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot snap;
- TransactionId xid;
TransactionId safeXid;
- TransactionId *newxip;
- int newxcnt = 0;
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
MyProc->xmin = snap->xmin;
+ /* Convert the historic snapshot to MVCC snapshot. */
+ return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+ TransactionId xid;
+ TransactionId *oldxip = snapshot->xip;
+ uint32 oldxcnt = snapshot->xcnt;
+ TransactionId *newxip;
+ int newxcnt = 0;
+ Snapshot result;
+
/* allocate in transaction context */
newxip = (TransactionId *)
palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* classical snapshot by marking all non-committed transactions as
* in-progress. This can be expensive.
*/
- for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+ for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
{
void *test;
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* Check whether transaction committed using the decoding snapshot
* meaning of ->xip.
*/
- test = bsearch(&xid, snap->xip, snap->xcnt,
+ test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
sizeof(TransactionId), xidComparator);
if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
}
/* adjust remaining snapshot fields as needed */
- snap->snapshot_type = SNAPSHOT_MVCC;
- snap->xcnt = newxcnt;
- snap->xip = newxip;
+ snapshot->xcnt = newxcnt;
+ snapshot->xip = newxip;
+
+ if (in_place)
+ result = snapshot;
+ else
+ {
+ result = CopySnapshot(snapshot);
+
+ /* Restore the original values so the source is intact. */
+ snapshot->xip = oldxip;
+ snapshot->xcnt = oldxcnt;
+ }
+ result->snapshot_type = SNAPSHOT_MVCC;
- return snap;
+ return result;
}
/*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..70a6b8902d1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
@@ -591,7 +590,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
* to 0. The returned snapshot has the copied flag set.
*/
-static Snapshot
+Snapshot
CopySnapshot(Snapshot snapshot)
{
Snapshot newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..147b190210a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -60,6 +60,7 @@ extern Snapshot GetTransactionSnapshot(void);
extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
+extern Snapshot CopySnapshot(Snapshot snapshot);
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
--
2.47.1
v15-0003-Move-the-recheck-branch-to-a-separate-function.patchtext/x-diffDownload
From c21c88114f9740d8f5db864e87a29d27bf436f3b Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:43 +0200
Subject: [PATCH 3/7] Move the "recheck" branch to a separate function.
At some point I thought that the relation must be unlocked during the call of
setup_logical_decoding(), to avoid a deadlock. In that case we'd need to
recheck afterwards if the table still meets the requirements of cluster_rel().
Eventually I concluded that the risk of that deadlock is not that high, so the
table stays locked during the call of setup_logical_decoding(). Therefore the
rechecking code is only executed once per table. Anyway, this patch might be
useful in terms of code readability.
---
src/backend/commands/cluster.c | 108 +++++++++++++++++++--------------
1 file changed, 62 insertions(+), 46 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 5e94b570431..57ae5d561fd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -69,6 +69,8 @@ typedef struct
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
ClusterCommand cmd);
+static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
@@ -317,53 +319,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- {
- /* Check that the user still has privileges for the relation */
- if (!cluster_is_permitted_for_relation(tableOid, save_userid,
- CLUSTER_COMMAND_CLUSTER))
- {
- relation_close(OldHeap, AccessExclusiveLock);
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ params->options))
goto out;
- }
-
- /*
- * Silently skip a temp table for a remote session. Only doing this
- * check in the "recheck" case is appropriate (which currently means
- * somebody is executing a database-wide CLUSTER or on a partitioned
- * table), because there is another check in cluster() which will stop
- * any attempt to cluster remote temp tables by name. There is
- * another check in cluster_rel which is redundant, but we leave it
- * for extra safety.
- */
- if (RELATION_IS_OTHER_TEMP(OldHeap))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- if (OidIsValid(indexOid))
- {
- /*
- * Check that the index still exists
- */
- if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
-
- /*
- * Check that the index is still the one with indisclustered set,
- * if needed.
- */
- if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
- !get_index_isclustered(indexOid))
- {
- relation_close(OldHeap, AccessExclusiveLock);
- goto out;
- }
- }
- }
/*
* We allow VACUUM FULL, but not CLUSTER, on shared catalogs. CLUSTER
@@ -465,6 +423,64 @@ out:
pgstat_progress_end_command();
}
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
+ int options)
+{
+ Oid tableOid = RelationGetRelid(OldHeap);
+
+ /* Check that the user still has privileges for the relation */
+ if (!cluster_is_permitted_for_relation(tableOid, userid,
+ CLUSTER_COMMAND_CLUSTER))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Silently skip a temp table for a remote session. Only doing this check
+ * in the "recheck" case is appropriate (which currently means somebody is
+ * executing a database-wide CLUSTER or on a partitioned table), because
+ * there is another check in cluster() which will stop any attempt to
+ * cluster remote temp tables by name. There is another check in
+ * cluster_rel which is redundant, but we leave it for extra safety.
+ */
+ if (RELATION_IS_OTHER_TEMP(OldHeap))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ if (OidIsValid(indexOid))
+ {
+ /*
+ * Check that the index still exists
+ */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+
+ /*
+ * Check that the index is still the one with indisclustered set, if
+ * needed.
+ */
+ if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+ !get_index_isclustered(indexOid))
+ {
+ relation_close(OldHeap, AccessExclusiveLock);
+ return false;
+ }
+ }
+
+ return true;
+}
+
/*
* Verify that the specified heap and index are valid to cluster on
*
--
2.47.1
v15-0004-Add-CONCURRENTLY-option-to-REPACK-command.patchtext/plainDownload
From a6ad4211b927fdcaeb83a244d595d9dda4579a9b Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:43 +0200
Subject: [PATCH 4/7] Add CONCURRENTLY option to REPACK command.
The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we should not request a
stronger lock without releasing the weaker one first, we acquire the exclusive
lock in the beginning and keep it till the end of the processing.)
This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even to write to it.
First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.
Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock that we need to swap the files. (Of course, more
data changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)
Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.
The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.
Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of a WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.
The WAL records produced by running DML commands on the new relation do not
contain enough information to be processed by the logical decoding system. All
we need from the new relation is the file (relfilenode), while the actual
relation is eventually dropped. Thus there is no point in replaying the DMLs
anywhere.
---
doc/src/sgml/monitoring.sgml | 37 +-
doc/src/sgml/mvcc.sgml | 12 +-
doc/src/sgml/ref/repack.sgml | 129 +-
src/Makefile | 1 +
src/backend/access/heap/heapam.c | 34 +-
src/backend/access/heap/heapam_handler.c | 215 +-
src/backend/access/heap/rewriteheap.c | 6 +-
src/backend/access/transam/xact.c | 11 +-
src/backend/catalog/index.c | 43 +-
src/backend/catalog/system_views.sql | 30 +-
src/backend/commands/cluster.c | 1895 +++++++++++++++--
src/backend/commands/matview.c | 2 +-
src/backend/commands/tablecmds.c | 1 +
src/backend/commands/vacuum.c | 12 +-
src/backend/meson.build | 1 +
src/backend/parser/gram.y | 15 +-
src/backend/replication/logical/decode.c | 83 +
src/backend/replication/logical/snapbuild.c | 20 +
.../replication/pgoutput_repack/Makefile | 32 +
.../replication/pgoutput_repack/meson.build | 18 +
.../pgoutput_repack/pgoutput_repack.c | 288 +++
src/backend/storage/ipc/ipci.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/cache/relcache.c | 1 +
src/backend/utils/time/snapmgr.c | 3 +-
src/bin/psql/tab-complete.in.c | 25 +-
src/include/access/heapam.h | 9 +-
src/include/access/heapam_xlog.h | 2 +
src/include/access/tableam.h | 10 +
src/include/catalog/index.h | 3 +
src/include/commands/cluster.h | 87 +-
src/include/commands/progress.h | 23 +-
src/include/nodes/parsenodes.h | 1 +
src/include/replication/snapbuild.h | 1 +
src/include/storage/lockdefs.h | 4 +-
src/include/storage/lwlocklist.h | 1 +
src/include/utils/snapmgr.h | 2 +
src/test/regress/expected/rules.out | 29 +-
src/tools/pgindent/typedefs.list | 4 +
39 files changed, 2764 insertions(+), 328 deletions(-)
create mode 100644 src/backend/replication/pgoutput_repack/Makefile
create mode 100644 src/backend/replication/pgoutput_repack/meson.build
create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index da883bb22f1..cae24f15624 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6061,14 +6061,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>heap_tuples_written</structfield> <type>bigint</type>
+ <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
</para>
<para>
- Number of heap tuples written.
+ Number of heap tuples inserted.
This counter only advances when the phase is
<literal>seq scanning heap</literal>,
- <literal>index scanning heap</literal>
- or <literal>writing new heap</literal>.
+ <literal>index scanning heap</literal>,
+ <literal>writing new heap</literal>
+ or <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples updated.
+ This counter only advances when the phase is <literal>catch-up</literal>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of heap tuples deleted.
+ This counter only advances when the phase is <literal>catch-up</literal>.
</para></entry>
</row>
@@ -6149,6 +6170,14 @@ FROM pg_stat_get_backend_idset() AS backendid;
<command>REPACK</command> is currently writing the new heap.
</entry>
</row>
+ <row>
+ <entry><literal>catch-up</literal></entry>
+ <entry>
+ <command>REPACK CONCURRENTLY</command> is currently processing the DML
+ commands that other transactions executed during any of the preceding
+ phase.
+ </entry>
+ </row>
<row>
<entry><literal>swapping relation files</literal></entry>
<entry>
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 049ee75a4ba..0f5c34af542 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,15 +1833,17 @@ SELECT pg_advisory_lock(q.id) FROM
<title>Caveats</title>
<para>
- Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
- table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
+ Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
+ table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
+ TABLE</command></link> and <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option, are not
MVCC-safe. This means that after the truncation or rewrite commits, the
table will appear empty to concurrent transactions, if they are using a
- snapshot taken before the DDL command committed. This will only be an
+ snapshot taken before the command committed. This will only be an
issue for a transaction that did not access the table in question
- before the DDL command started — any transaction that has done so
+ before the command started — any transaction that has done so
would hold at least an <literal>ACCESS SHARE</literal> table lock,
- which would block the DDL command until that transaction completes.
+ which would block the truncating or rewriting command until that transaction completes.
So these commands will not cause any apparent inconsistency in the
table contents for successive queries on the target table, but they
could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index a612c72d971..9c089a6b3d7 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCURRENTLY <replaceable class="parameter">table_name</replaceable> [ USING INDEX <replaceable class="parameter">index_name</replaceable> ]
<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
@@ -48,7 +49,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
processes every table and materialized view in the current database that
the current user has the <literal>MAINTAIN</literal> privilege on. This
form of <command>REPACK</command> cannot be executed inside a transaction
- block.
+ block. Also, this form is not allowed if
+ the <literal>CONCURRENTLY</literal> option is used.
</para>
<para>
@@ -61,7 +63,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
is acquired on it. This prevents any other database operations (both reads
and writes) from operating on the table until the <command>REPACK</command>
- is finished.
+ is finished. If you want to keep the table accessible during the repacking,
+ consider using the <literal>CONCURRENTLY</literal> option.
</para>
<refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -160,6 +163,128 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>CONCURRENTLY</literal></term>
+ <listitem>
+ <para>
+ Allow other transactions to use the table while it is being repacked.
+ </para>
+
+ <para>
+ Internally, <command>REPACK</command> copies the contents of the table
+ (ignoring dead tuples) into a new file, sorted by the specified index,
+ and also creates a new file for each index. Then it swaps the old and
+ new files for the table and all the indexes, and deletes the old
+ files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+ sure that the old files do not change during the processing because the
+ changes would get lost due to the swap.
+ </para>
+
+ <para>
+ With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+ EXCLUSIVE</literal> lock is only acquired to swap the table and index
+ files. The data changes that took place during the creation of the new
+ table and index files are captured using logical decoding
+ (<xref linkend="logicaldecoding"/>) and applied before
+ the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+ is typically held only for the time needed to swap the files, which
+ should be pretty short. However, the time might still be noticeable if
+ too many data changes have been done to the table while
+ <command>REPACK</command> was waiting for the lock: those changes must
+ be processed just before the files are swapped, while the
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ </para>
+
+ <para>
+ Note that <command>REPACK</command> with the
+ the <literal>CONCURRENTLY</literal> option does not try to order the
+ rows inserted into the table after the repacking started. Also
+ note <command>REPACK</command> might fail to complete due to DDL
+ commands executed on the table by other transactions during the
+ repacking.
+ </para>
+
+ <note>
+ <para>
+ In addition to the temporary space requirements explained in
+ <xref linkend="sql-repack-notes-on-resources"/>,
+ the <literal>CONCURRENTLY</literal> option can add to the usage of
+ temporary space a bit more. The reason is that other transactions can
+ perform DML operations which cannot be applied to the new file until
+ <command>REPACK</command> has copied all the tuples from the old
+ file. Thus the tuples inserted into the old file during the copying are
+ also stored separately in a temporary file, so they can eventually be
+ applied to the new file.
+ </para>
+
+ <para>
+ Furthermore, the data changes performed during the copying are
+ extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+ this extraction (decoding) only takes place when certain amount of WAL
+ has been written. Therefore, WAL removal can be delayed by this
+ threshold. Currently the threshold is equal to the value of
+ the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+ configuration parameter.
+ </para>
+ </note>
+
+ <para>
+ The <literal>CONCURRENTLY</literal> option cannot be used in the
+ following cases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The table is <literal>UNLOGGED</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is partitioned.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The table is a system catalog or a <acronym>TOAST</acronym> table.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <command>REPACK</command> is executed inside a transaction block.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+ configuration parameter is less than <literal>logical</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+ configuration parameter does not allow for creation of an additional
+ replication slot.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <warning>
+ <para>
+ <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
+ option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
+ details.
+ </para>
+ </warning>
+
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>VERBOSE</literal></term>
<listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
interfaces \
backend/replication/libpqwalreceiver \
backend/replication/pgoutput \
+ backend/replication/pgoutput_repack \
fe_utils \
bin \
pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..4fdb3e880e4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -2769,7 +2770,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
TM_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- TM_FailureData *tmfd, bool changingPart)
+ TM_FailureData *tmfd, bool changingPart, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -3016,7 +3017,8 @@ l1:
* Compute replica identity tuple before entering the critical section so
* we don't PANIC upon a memory allocation failure.
*/
- old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+ old_key_tuple = wal_logical ?
+ ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
/*
* If this is the first possibly-multixact-able operation in the current
@@ -3106,6 +3108,15 @@ l1:
xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
}
+ /*
+ * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+ * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+ * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+ * Consider not decoding tuples w/o the old tuple/key instead.
+ */
+ if (!wal_logical)
+ xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfHeapDelete);
@@ -3198,7 +3209,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, false /* changingPart */ );
+ &tmfd, false, /* changingPart */
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -3239,7 +3251,7 @@ TM_Result
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes)
+ TU_UpdateIndexes *update_indexes, bool wal_logical)
{
TM_Result result;
TransactionId xid = GetCurrentTransactionId();
@@ -4132,7 +4144,8 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ wal_logical);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4490,7 +4503,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
result = heap_update(relation, otid, tup,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
- &tmfd, &lockmode, update_indexes);
+ &tmfd, &lockmode, update_indexes,
+ true /* wal_logical */);
switch (result)
{
case TM_SelfModified:
@@ -8831,7 +8845,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool wal_logical)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -8842,7 +8857,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
suffixlen = 0;
XLogRecPtr recptr;
Page page = BufferGetPage(newbuf);
- bool need_tuple_data = RelationIsLogicallyLogged(reln);
+ bool need_tuple_data = RelationIsLogicallyLogged(reln) &&
+ wal_logical;
bool init;
int bufflags;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0b03070d394..c829c06f769 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
#include "catalog/index.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
@@ -309,7 +310,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
* the storage itself is cleaning the dead tuples by itself, it is the
* time to call the index tuple deletion also.
*/
- return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+ return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
+ true);
}
@@ -328,7 +330,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
tuple->t_tableOid = slot->tts_tableOid;
result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
- tmfd, lockmode, update_indexes);
+ tmfd, lockmode, update_indexes, true);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
/*
@@ -685,13 +687,15 @@ static void
heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
Relation OldIndex, bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
double *tups_vacuumed,
double *tups_recently_dead)
{
- RewriteState rwstate;
+ RewriteState rwstate = NULL;
IndexScanDesc indexScan;
TableScanDesc tableScan;
HeapScanDesc heapScan;
@@ -705,6 +709,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
bool *isnull;
BufferHeapTupleTableSlot *hslot;
BlockNumber prev_cblock = InvalidBlockNumber;
+ bool concurrent = snapshot != NULL;
+ XLogRecPtr end_of_wal_prev = GetFlushRecPtr(NULL);
/* Remember if it's a system catalog */
is_system_catalog = IsSystemRelation(OldHeap);
@@ -720,9 +726,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values = (Datum *) palloc(natts * sizeof(Datum));
isnull = (bool *) palloc(natts * sizeof(bool));
- /* Initialize the rewrite operation */
- rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
- *multi_cutoff);
+ /*
+ * Initialize the rewrite operation.
+ */
+ if (!concurrent)
+ rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin,
+ *xid_cutoff, *multi_cutoff);
/* Set up sorting if wanted */
@@ -737,6 +746,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
* Prepare to scan the OldHeap. To ensure we see recently-dead tuples
* that still need to be copied, we scan with SnapshotAny and use
* HeapTupleSatisfiesVacuum for the visibility test.
+ *
+ * In the CONCURRENTLY case, we do regular MVCC visibility tests, using
+ * the snapshot passed by the caller.
*/
if (OldIndex != NULL && !use_sort)
{
@@ -753,7 +765,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex,
+ snapshot ? snapshot :SnapshotAny,
+ NULL, 0, 0);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
@@ -762,7 +776,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
- tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ tableScan = table_beginscan(OldHeap,
+ snapshot ? snapshot :SnapshotAny,
+ 0, (ScanKey) NULL);
heapScan = (HeapScanDesc) tableScan;
indexScan = NULL;
@@ -785,6 +801,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
HeapTuple tuple;
Buffer buf;
bool isdead;
+ HTSV_Result vis;
CHECK_FOR_INTERRUPTS();
@@ -837,70 +854,84 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
buf = hslot->buffer;
- LockBuffer(buf, BUFFER_LOCK_SHARE);
-
- switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+ /*
+ * Regarding CONCURRENTLY, see the comments on MVCC snapshot above.
+ */
+ if (!concurrent)
{
- case HEAPTUPLE_DEAD:
- /* Definitely dead */
- isdead = true;
- break;
- case HEAPTUPLE_RECENTLY_DEAD:
- *tups_recently_dead += 1;
- /* fall through */
- case HEAPTUPLE_LIVE:
- /* Live or recently dead, must copy it */
- isdead = false;
- break;
- case HEAPTUPLE_INSERT_IN_PROGRESS:
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
- /*
- * Since we hold exclusive lock on the relation, normally the
- * only way to see this is if it was inserted earlier in our
- * own transaction. However, it can happen in system
- * catalogs, since we tend to release write lock before commit
- * there. Give a warning if neither case applies; but in any
- * case we had better copy it.
- */
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
- elog(WARNING, "concurrent insert in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as live */
- isdead = false;
- break;
- case HEAPTUPLE_DELETE_IN_PROGRESS:
+ switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
+ {
+ case HEAPTUPLE_DEAD:
+ /* Definitely dead */
+ isdead = true;
+ break;
+ case HEAPTUPLE_RECENTLY_DEAD:
+ *tups_recently_dead += 1;
+ /* fall through */
+ case HEAPTUPLE_LIVE:
+ /* Live or recently dead, must copy it */
+ isdead = false;
+ break;
+ case HEAPTUPLE_INSERT_IN_PROGRESS:
/*
- * Similar situation to INSERT_IN_PROGRESS case.
+ * As long as we hold exclusive lock on the relation, normally
+ * the only way to see this is if it was inserted earlier in
+ * our own transaction. However, it can happen in system
+ * catalogs, since we tend to release write lock before commit
+ * there. Also, there's no exclusive lock during concurrent
+ * processing. Give a warning if neither case applies; but in
+ * any case we had better copy it.
*/
- if (!is_system_catalog &&
- !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
- elog(WARNING, "concurrent delete in progress within table \"%s\"",
- RelationGetRelationName(OldHeap));
- /* treat as recently dead */
- *tups_recently_dead += 1;
- isdead = false;
- break;
- default:
- elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
- isdead = false; /* keep compiler quiet */
- break;
- }
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
+ elog(WARNING, "concurrent insert in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as live */
+ isdead = false;
+ break;
+ case HEAPTUPLE_DELETE_IN_PROGRESS:
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ /*
+ * Similar situation to INSERT_IN_PROGRESS case.
+ */
+ if (!is_system_catalog && !concurrent &&
+ !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
+ elog(WARNING, "concurrent delete in progress within table \"%s\"",
+ RelationGetRelationName(OldHeap));
+ /* treat as recently dead */
+ *tups_recently_dead += 1;
+ isdead = false;
+ break;
+ default:
+ elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+ isdead = false; /* keep compiler quiet */
+ break;
+ }
- if (isdead)
- {
- *tups_vacuumed += 1;
- /* heap rewrite module still needs to see it... */
- if (rewrite_heap_dead_tuple(rwstate, tuple))
+ if (isdead)
{
- /* A previous recently-dead tuple is now known dead */
*tups_vacuumed += 1;
- *tups_recently_dead -= 1;
+ /* heap rewrite module still needs to see it... */
+ if (rewrite_heap_dead_tuple(rwstate, tuple))
+ {
+ /* A previous recently-dead tuple is now known dead */
+ *tups_vacuumed += 1;
+ *tups_recently_dead -= 1;
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ continue;
}
- continue;
+
+ /*
+ * In the concurrent case, we have a copy of the tuple, so we
+ * don't worry whether the source tuple will be deleted / updated
+ * after we release the lock.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
*num_tuples += 1;
@@ -919,7 +950,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
{
const int ct_index[] = {
PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
- PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+ PROGRESS_REPACK_HEAP_TUPLES_INSERTED
};
int64 ct_val[2];
@@ -934,6 +965,31 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
ct_val[1] = *num_tuples;
pgstat_progress_update_multi_param(2, ct_index, ct_val);
}
+
+ /*
+ * Process the WAL produced by the load, as well as by other
+ * transactions, so that the replication slot can advance and WAL does
+ * not pile up. Use wal_segment_size as a threshold so that we do not
+ * introduce the decoding overhead too often.
+ *
+ * Of course, we must not apply the changes until the initial load has
+ * completed.
+ *
+ * Note that our insertions into the new table should not be decoded
+ * as we (intentionally) do not write the logical decoding specific
+ * information to WAL.
+ */
+ if (concurrent)
+ {
+ XLogRecPtr end_of_wal;
+
+ end_of_wal = GetFlushRecPtr(NULL);
+ if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+ {
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ end_of_wal_prev = end_of_wal;
+ }
+ }
}
if (indexScan != NULL)
@@ -977,7 +1033,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
values, isnull,
rwstate);
/* Report n_tuples */
- pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+ pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
n_tuples);
}
@@ -985,7 +1041,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}
/* Write out any remaining tuples, and fsync if needed */
- end_heap_rewrite(rwstate);
+ if (rwstate)
+ end_heap_rewrite(rwstate);
/* Clean up */
pfree(values);
@@ -2376,6 +2433,10 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
* SET WITHOUT OIDS.
*
* So, we must reconstruct the tuple from component Datums.
+ *
+ * If rwstate=NULL, use simple_heap_insert() instead of rewriting - in that
+ * case we still need to deform/form the tuple. TODO Shouldn't we rename the
+ * function, as might not do any rewrite?
*/
static void
reform_and_rewrite_tuple(HeapTuple tuple,
@@ -2398,8 +2459,28 @@ reform_and_rewrite_tuple(HeapTuple tuple,
copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
- /* The heap rewrite module does the rest */
- rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ if (rwstate)
+ /* The heap rewrite module does the rest */
+ rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ else
+ {
+ /*
+ * Insert tuple when processing REPACK CONCURRENTLY.
+ *
+ * rewriteheap.c is not used in the CONCURRENTLY case because it'd be
+ * difficult to do the same in the catch-up phase (as the logical
+ * decoding does not provide us with sufficient visibility
+ * information). Thus we must use heap_insert() both during the
+ * catch-up and here.
+ *
+ * The following is like simple_heap_insert() except that we pass the
+ * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
+ * the relation files, it drops this relation, so no logical
+ * replication subscription should need the data.
+ */
+ heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
+ HEAP_INSERT_NO_LOGICAL, NULL);
+ }
heap_freetuple(copiedTuple);
}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
int options = HEAP_INSERT_SKIP_FSM;
/*
- * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
- * for the TOAST table are not logically decoded. The main heap is
- * WAL-logged as XLOG FPI records, which are not logically decoded.
+ * While rewriting the heap for REPACK, make sure data for the TOAST
+ * table are not logically decoded. The main heap is WAL-logged as
+ * XLOG FPI records, which are not logically decoded.
*/
options |= HEAP_INSERT_NO_LOGICAL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..23f2de587a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
bool parallelChildXact; /* is any parent transaction parallel? */
bool chain; /* start a new block after this one */
bool topXidLogged; /* for a subxact: is top-level XID logged? */
+ bool internal; /* for a subxact: launched internally? */
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;
@@ -4723,6 +4724,7 @@ BeginInternalSubTransaction(const char *name)
/* Normal subtransaction start */
PushTransaction();
s = CurrentTransactionState; /* changed by push */
+ s->internal = true;
/*
* Savepoint names, like the TransactionState block itself, live
@@ -5239,7 +5241,13 @@ AbortSubTransaction(void)
LWLockReleaseAll();
pgstat_report_wait_end();
- pgstat_progress_end_command();
+
+ /*
+ * Internal subtransacion might be used by an user command, in which case
+ * the command outlives the subtransaction.
+ */
+ if (!s->internal)
+ pgstat_progress_end_command();
pgaio_error_cleanup();
@@ -5456,6 +5464,7 @@ PushTransaction(void)
s->parallelModeLevel = 0;
s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
s->topXidLogged = false;
+ s->internal = false;
CurrentTransactionState = s;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 96357d1170c..29428b5d857 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1418,22 +1418,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
opclassOptions[i] = get_attoptions(oldIndexId, i + 1);
- /* Extract statistic targets for each attribute */
- stattargets = palloc0_array(NullableDatum, newInfo->ii_NumIndexAttrs);
- for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
- {
- HeapTuple tp;
- Datum dat;
-
- tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(oldIndexId), Int16GetDatum(i + 1));
- if (!HeapTupleIsValid(tp))
- elog(ERROR, "cache lookup failed for attribute %d of relation %u",
- i + 1, oldIndexId);
- dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
- ReleaseSysCache(tp);
- stattargets[i].value = dat;
- stattargets[i].isnull = isnull;
- }
+ stattargets = get_index_stattargets(oldIndexId, newInfo);
/*
* Now create the new index.
@@ -1472,6 +1457,32 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+NullableDatum *
+get_index_stattargets(Oid indexid, IndexInfo *indInfo)
+{
+ NullableDatum *stattargets;
+
+ /* Extract statistic targets for each attribute */
+ stattargets = palloc0_array(NullableDatum, indInfo->ii_NumIndexAttrs);
+ for (int i = 0; i < indInfo->ii_NumIndexAttrs; i++)
+ {
+ HeapTuple tp;
+ Datum dat;
+ bool isnull;
+
+ tp = SearchSysCache2(ATTNUM, ObjectIdGetDatum(indexid), Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+ i + 1, indexid);
+ dat = SysCacheGetAttr(ATTNUM, tp, Anum_pg_attribute_attstattarget, &isnull);
+ ReleaseSysCache(tp);
+ stattargets[i].value = dat;
+ stattargets[i].isnull = isnull;
+ }
+
+ return stattargets;
+}
+
/*
* index_concurrently_build
*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7380b6e3d7b..a9d3a4b5787 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1258,16 +1258,17 @@ CREATE VIEW pg_stat_progress_cluster AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ -- 5 is 'catch-up', but that should not appear here.
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS cluster_index_relid,
S.param4 AS heap_tuples_scanned,
S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('CLUSTER') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
@@ -1283,16 +1284,19 @@ CREATE VIEW pg_stat_progress_repack AS
WHEN 2 THEN 'index scanning heap'
WHEN 3 THEN 'sorting tuples'
WHEN 4 THEN 'writing new heap'
- WHEN 5 THEN 'swapping relation files'
- WHEN 6 THEN 'rebuilding index'
- WHEN 7 THEN 'performing final cleanup'
+ WHEN 5 THEN 'catch-up'
+ WHEN 6 THEN 'swapping relation files'
+ WHEN 7 THEN 'rebuilding index'
+ WHEN 8 THEN 'performing final cleanup'
END AS phase,
CAST(S.param3 AS oid) AS repack_index_relid,
S.param4 AS heap_tuples_scanned,
- S.param5 AS heap_tuples_written,
- S.param6 AS heap_blks_total,
- S.param7 AS heap_blks_scanned,
- S.param8 AS index_rebuild_count
+ S.param5 AS heap_tuples_inserted,
+ S.param6 AS heap_tuples_updated,
+ S.param7 AS heap_tuples_deleted,
+ S.param8 AS heap_blks_total,
+ S.param9 AS heap_blks_scanned,
+ S.param10 AS index_rebuild_count
FROM pg_stat_get_progress_info('REPACK') AS S
LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 57ae5d561fd..408bdbdff3b 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
#include "access/toast_internals.h"
#include "access/transam.h"
#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/heap.h"
@@ -32,6 +36,7 @@
#include "catalog/namespace.h"
#include "catalog/objectaccess.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
#include "catalog/pg_inherits.h"
#include "catalog/toasting.h"
#include "commands/cluster.h"
@@ -39,10 +44,15 @@
#include "commands/progress.h"
#include "commands/tablecmds.h"
#include "commands/vacuum.h"
+#include "executor/executor.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/acl.h"
@@ -67,15 +77,45 @@ typedef struct
Oid indexOid;
} RelToCluster;
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+ ResultRelInfo *rri;
+ EState *estate;
+
+ Relation ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
static void cluster_multiple_rels(List *rtcs, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, LOCKMODE lockmode,
+ bool isTopLevel);
static bool cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options);
+ LOCKMODE lmode, int options);
+static void check_repack_concurrently_requirements(Relation rel);
static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd);
+ bool concurrent, Oid userid, ClusterCommand cmd);
static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
- bool verbose, bool *pSwapToastByContent,
- TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose,
+ bool *pSwapToastByContent,
+ TransactionId *pFreezeXid,
+ MultiXactId *pCutoffMulti);
static List *get_tables_to_cluster(MemoryContext cluster_context);
static List *get_tables_to_repack(MemoryContext repack_context);
static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
@@ -83,7 +123,53 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+ const char *slotname,
+ TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+ Relation rel, ScanKey key, int nkeys,
+ IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+ HeapTuple tup, IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+ HeapTuple tup_target,
+ ConcurrentChange *change,
+ IndexInsertState *iistate,
+ TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+ HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot,
+ IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal,
+ Relation rel_dst,
+ Relation rel_src,
+ ScanKey ident_key,
+ int ident_key_nentries,
+ IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+ Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+ int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
static Relation process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode,
+ bool isTopLevel,
ClusterParams *params,
ClusterCommand cmd,
Oid *indexOid_p);
@@ -142,8 +228,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
if (stmt->relation != NULL)
{
- /* This is the single-relation case. */
rel = process_single_relation(stmt->relation, stmt->indexname,
+ AccessExclusiveLock, isTopLevel,
¶ms, CLUSTER_COMMAND_CLUSTER,
&indexOid);
if (rel == NULL)
@@ -194,7 +280,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
}
/* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER);
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_CLUSTER,
+ AccessExclusiveLock, isTopLevel);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -211,7 +298,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
* return.
*/
static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
+cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd,
+ LOCKMODE lockmode, bool isTopLevel)
{
ListCell *lc;
@@ -231,10 +319,10 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
- rel = table_open(rtc->tableOid, AccessExclusiveLock);
+ rel = table_open(rtc->tableOid, lockmode);
/* Process this table */
- cluster_rel(rel, rtc->indexOid, params, cmd);
+ cluster_rel(rel, rtc->indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
PopActiveSnapshot();
@@ -258,12 +346,18 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params, ClusterCommand cmd)
* instead of index order. This is the new implementation of VACUUM FULL,
* and error messages should refer to the operation as VACUUM not CLUSTER.
*
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
* 'cmd' indicates which command is being executed. REPACK should be the only
* caller of this function in the future.
*/
void
cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd)
+ ClusterCommand cmd, bool isTopLevel)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid save_userid;
@@ -272,8 +366,34 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
bool verbose = ((params->options & CLUOPT_VERBOSE) != 0);
bool recheck = ((params->options & CLUOPT_RECHECK) != 0);
Relation index;
+ bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+ LOCKMODE lmode;
+
+ /*
+ * Check that the correct lock is held. The lock mode is
+ * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+ * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+ * commands work, but cluster_rel() cannot be called concurrently for the
+ * same relation).
+ */
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ /* There are specific requirements on concurrent processing. */
+ if (concurrent)
+ {
+ /*
+ * Make sure we have no XID assigned, otherwise call of
+ * setup_logical_decoding() can cause a deadlock.
+ *
+ * The existence of transaction block actually does not imply that XID
+ * was already assigned, but it very likely is. We might want to check
+ * the result of GetCurrentTransactionIdIfAny() instead, but that
+ * would be less clear from user's perspective.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK CONCURRENTLY");
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+ check_repack_concurrently_requirements(OldHeap);
+ }
/* Check for user-requested abort. */
CHECK_FOR_INTERRUPTS();
@@ -319,7 +439,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* to cluster a not-previously-clustered index.
*/
if (recheck)
- if (!cluster_rel_recheck(OldHeap, indexOid, save_userid,
+ if (!cluster_rel_recheck(OldHeap, indexOid, save_userid, lmode,
params->options))
goto out;
@@ -338,6 +458,12 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a shared catalog")));
+ /*
+ * The CONCURRENTLY case should have been rejected earlier because it does
+ * not support system catalogs.
+ */
+ Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
/*
* Don't process temp tables of other backends ... their local buffer
* manager is not going to cope.
@@ -376,7 +502,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OidIsValid(indexOid))
{
/* verify the index is good and lock it */
- check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+ check_index_is_clusterable(OldHeap, indexOid, lmode);
/* also open it */
index = index_open(indexOid, NoLock);
}
@@ -393,7 +519,9 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
!RelationIsPopulated(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ if (index)
+ index_close(index, lmode);
+ relation_close(OldHeap, lmode);
goto out;
}
@@ -406,11 +534,35 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
* invalid, because we move tuples around. Promote them to relation
* locks. Predicate locks on indexes will be promoted when they are
* reindexed.
+ *
+ * During concurrent processing, the heap as well as its indexes stay in
+ * operation, so we postpone this step until they are locked using
+ * AccessExclusiveLock near the end of the processing.
*/
- TransferPredicateLocksToHeapRelation(OldHeap);
+ if (!concurrent)
+ TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
- rebuild_relation(OldHeap, index, verbose, cmd);
+ PG_TRY();
+ {
+ /*
+ * For concurrent processing, make sure that our logical decoding
+ * ignores data changes of other tables than the one we are
+ * processing.
+ */
+ if (concurrent)
+ begin_concurrent_repack(OldHeap);
+
+ rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
+ cmd);
+ }
+ PG_FINALLY();
+ {
+ if (concurrent)
+ end_concurrent_repack();
+ }
+ PG_END_TRY();
+
/* rebuild_relation closes OldHeap, and index if valid */
out:
@@ -429,7 +581,7 @@ out:
*/
static bool
cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
- int options)
+ LOCKMODE lmode, int options)
{
Oid tableOid = RelationGetRelid(OldHeap);
@@ -437,7 +589,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if (!cluster_is_permitted_for_relation(tableOid, userid,
CLUSTER_COMMAND_CLUSTER))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -451,7 +603,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (RELATION_IS_OTHER_TEMP(OldHeap))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -462,7 +614,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
*/
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
@@ -473,7 +625,7 @@ cluster_rel_recheck(Relation OldHeap, Oid indexOid, Oid userid,
if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
!get_index_isclustered(indexOid))
{
- relation_close(OldHeap, AccessExclusiveLock);
+ relation_close(OldHeap, lmode);
return false;
}
}
@@ -614,19 +766,87 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
table_close(pg_index, RowExclusiveLock);
}
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+ char relpersistence,
+ replident;
+ Oid ident_idx;
+
+ /* Data changes in system relations are not logically decoded. */
+ if (IsCatalogRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+ /*
+ * reorderbuffer.c does not seem to handle processing of TOAST relation
+ * alone.
+ */
+ if (IsToastRelation(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+ relpersistence = rel->rd_rel->relpersistence;
+ if (relpersistence != RELPERSISTENCE_PERMANENT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+ /* With NOTHING, WAL does not contain the old tuple. */
+ replident = rel->rd_rel->relreplident;
+ if (replident == REPLICA_IDENTITY_NOTHING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot repack relation \"%s\"",
+ RelationGetRelationName(rel)),
+ errhint("Relation \"%s\" has insufficient replication identity.",
+ RelationGetRelationName(rel))));
+
+ /*
+ * Identity index is not set if the replica identity is FULL, but PK might
+ * exist in such a case.
+ */
+ ident_idx = RelationGetReplicaIndex(rel);
+ if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+ ident_idx = rel->rd_pkindex;
+ if (!OidIsValid(ident_idx))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot process relation \"%s\"",
+ RelationGetRelationName(rel)),
+ (errhint("Relation \"%s\" has no identity index.",
+ RelationGetRelationName(rel)))));
+}
+
/*
* rebuild_relation: rebuild an existing relation in index or physical order
*
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild. See cluster_rel() for comments on the required
+ * lock strength.
+ *
* index: index to cluster by, or NULL to rewrite in physical order.
*
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them (AccessExclusiveLock for exclusive processing
+ * and ShareUpdateExclusiveLock for concurrent processing)..
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock (The
+ * function handles the lock upgrade if 'concurrent' is true.)
*/
static void
rebuild_relation(Relation OldHeap, Relation index, bool verbose,
- ClusterCommand cmd)
+ bool concurrent, Oid userid, ClusterCommand cmd)
{
Oid tableOid = RelationGetRelid(OldHeap);
Oid accessMethod = OldHeap->rd_rel->relam;
@@ -634,13 +854,55 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
Oid OIDNewHeap;
Relation NewHeap;
char relpersistence;
- bool is_system_catalog;
bool swap_toast_by_content;
TransactionId frozenXid;
MultiXactId cutoffMulti;
+ NameData slotname;
+ LogicalDecodingContext *ctx = NULL;
+ Snapshot snapshot = NULL;
+#if USE_ASSERT_CHECKING
+ LOCKMODE lmode;
+
+ lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ Assert(CheckRelationLockedByMe(OldHeap, lmode, false) &&
+ (index == NULL || CheckRelationLockedByMe(index, lmode, false)));
+#endif
+
+ if (concurrent)
+ {
+ TupleDesc tupdesc;
- Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
- (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+ /*
+ * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+ * should never fire.
+ */
+ Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+ /*
+ * A single backend should not execute multiple REPACK commands at a
+ * time, so use PID to make the slot unique.
+ */
+ snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+ tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+ /*
+ * Prepare to capture the concurrent data changes.
+ *
+ * Note that this call waits for all transactions with XID already
+ * assigned to finish. If some of those transactions is waiting for a
+ * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+ * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+ * risk is worth unlocking/locking the table (and its clustering
+ * index) and checking again if its still eligible for REPACK
+ * CONCURRENTLY.
+ */
+ ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+ snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+ PushActiveSnapshot(snapshot);
+ }
if (index && cmd == CLUSTER_COMMAND_CLUSTER)
/* Mark the correct index as clustered */
@@ -648,7 +910,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
- is_system_catalog = IsSystemRelation(OldHeap);
/*
* Create the transient table that will receive the re-ordered data.
@@ -664,30 +925,67 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose,
NewHeap = table_open(OIDNewHeap, NoLock);
/* Copy the heap data into the new table in the desired order */
- copy_table_data(NewHeap, OldHeap, index, verbose,
+ copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
&swap_toast_by_content, &frozenXid, &cutoffMulti);
+ /* The historic snapshot won't be needed anymore. */
+ if (snapshot)
+ PopActiveSnapshot();
- /* Close relcache entries, but keep lock until transaction commit */
- table_close(OldHeap, NoLock);
- if (index)
- index_close(index, NoLock);
+ if (concurrent)
+ {
+ /*
+ * Push a snapshot that we will use to find old versions of rows when
+ * processing concurrent UPDATE and DELETE commands. (That snapshot
+ * should also be used by index expressions.)
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
- /*
- * Close the new relation so it can be dropped as soon as the storage is
- * swapped. The relation is not visible to others, so no need to unlock it
- * explicitly.
- */
- table_close(NewHeap, NoLock);
+ /*
+ * Make sure we can find the tuples just inserted when applying DML
+ * commands on top of those.
+ */
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
- /*
- * Swap the physical files of the target and transient tables, then
- * rebuild the target's indexes and throw away the transient table.
- */
- finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
- swap_toast_by_content, false, true,
- frozenXid, cutoffMulti,
- relpersistence);
+ rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+ ctx, swap_toast_by_content,
+ frozenXid, cutoffMulti);
+ PopActiveSnapshot();
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+ /* Done with decoding. */
+ cleanup_logical_decoding(ctx);
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(slotname), false);
+ }
+ else
+ {
+ bool is_system_catalog = IsSystemRelation(OldHeap);
+
+ /* Close relcache entries, but keep lock until transaction commit */
+ table_close(OldHeap, NoLock);
+ if (index)
+ index_close(index, NoLock);
+
+ /*
+ * Close the new relation so it can be dropped as soon as the storage
+ * is swapped. The relation is not visible to others, so no need to
+ * unlock it explicitly.
+ */
+ table_close(NewHeap, NoLock);
+
+ /*
+ * Swap the physical files of the target and transient tables, then
+ * rebuild the target's indexes and throw away the transient table.
+ */
+ finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+ swap_toast_by_content, false, true, true,
+ frozenXid, cutoffMulti,
+ relpersistence);
+ }
}
@@ -822,15 +1120,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
/*
* Do the physical copying of table data.
*
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
* There are three output parameters:
* *pSwapToastByContent is set true if toast tables must be swapped by content.
* *pFreezeXid receives the TransactionId used as freeze cutoff point.
* *pCutoffMulti receives the MultiXactId used as a cutoff point.
*/
static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
- bool *pSwapToastByContent, TransactionId *pFreezeXid,
- MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+ Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+ bool verbose, bool *pSwapToastByContent,
+ TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
{
Relation relRelation;
HeapTuple reltup;
@@ -848,6 +1150,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
PGRUsage ru0;
char *nspname;
+ bool concurrent = snapshot != NULL;
+
pg_rusage_init(&ru0);
/* Store a copy of the namespace name for logging purposes */
@@ -950,8 +1254,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* provided, else plain seqscan.
*/
if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+ {
+ ResourceOwner oldowner = NULL;
+ ResourceOwner resowner = NULL;
+
+ /*
+ * In the CONCURRENT case, use a dedicated resource owner so we don't
+ * leave any additional locks behind us that we cannot release easily.
+ */
+ if (concurrent)
+ {
+ Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+ false));
+ Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+ false));
+
+ resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "plan_cluster_use_sort");
+ oldowner = CurrentResourceOwner;
+ CurrentResourceOwner = resowner;
+ }
+
use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
RelationGetRelid(OldIndex));
+
+ if (concurrent)
+ {
+ CurrentResourceOwner = oldowner;
+
+ /*
+ * We are primarily concerned about locks, but if the planner
+ * happened to allocate any other resources, we should release
+ * them too because we're going to delete the whole resowner.
+ */
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+ false, false);
+ ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+ false, false);
+ ResourceOwnerDelete(resowner);
+ }
+ }
else
use_sort = false;
@@ -980,7 +1324,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
* values (e.g. because the AM doesn't use freezing).
*/
table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
- cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+ cutoffs.OldestXmin, snapshot,
+ decoding_ctx,
+ &cutoffs.FreezeLimit,
&cutoffs.MultiXactCutoff,
&num_tuples, &tups_vacuumed,
&tups_recently_dead);
@@ -989,7 +1335,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
*pFreezeXid = cutoffs.FreezeLimit;
*pCutoffMulti = cutoffs.MultiXactCutoff;
- /* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+ /*
+ * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+ * In the CONCURRENTLY case, we need to set it again before applying the
+ * concurrent changes.
+ */
NewHeap->rd_toastoid = InvalidOid;
num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1447,14 +1797,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence)
{
ObjectAddress object;
Oid mapped_tables[4];
- int reindex_flags;
- ReindexParams reindex_params = {0};
int i;
/* Report that we are now swapping relation files */
@@ -1480,39 +1829,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
if (is_system_catalog)
CacheInvalidateCatalog(OIDOldHeap);
- /*
- * Rebuild each index on the relation (but not the toast table, which is
- * all-new at this point). It is important to do this before the DROP
- * step because if we are processing a system catalog that will be used
- * during DROP, we want to have its indexes available. There is no
- * advantage to the other order anyway because this is all transactional,
- * so no chance to reclaim disk space before commit. We do not need a
- * final CommandCounterIncrement() because reindex_relation does it.
- *
- * Note: because index_build is called via reindex_relation, it will never
- * set indcheckxmin true for the indexes. This is OK even though in some
- * sense we are building new indexes rather than rebuilding existing ones,
- * because the new heap won't contain any HOT chains at all, let alone
- * broken ones, so it can't be necessary to set indcheckxmin.
- */
- reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
- if (check_constraints)
- reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+ if (reindex)
+ {
+ int reindex_flags;
+ ReindexParams reindex_params = {0};
- /*
- * Ensure that the indexes have the same persistence as the parent
- * relation.
- */
- if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
- reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ /*
+ * Rebuild each index on the relation (but not the toast table, which
+ * is all-new at this point). It is important to do this before the
+ * DROP step because if we are processing a system catalog that will
+ * be used during DROP, we want to have its indexes available. There
+ * is no advantage to the other order anyway because this is all
+ * transactional, so no chance to reclaim disk space before commit. We
+ * do not need a final CommandCounterIncrement() because
+ * reindex_relation does it.
+ *
+ * Note: because index_build is called via reindex_relation, it will
+ * never set indcheckxmin true for the indexes. This is OK even
+ * though in some sense we are building new indexes rather than
+ * rebuilding existing ones, because the new heap won't contain any
+ * HOT chains at all, let alone broken ones, so it can't be necessary
+ * to set indcheckxmin.
+ */
+ reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+ if (check_constraints)
+ reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
- /* Report that we are now reindexing relations */
- pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
- PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+ /*
+ * Ensure that the indexes have the same persistence as the parent
+ * relation.
+ */
+ if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
- reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ /* Report that we are now reindexing relations */
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+ }
/* Report that we are now doing clean up */
pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1834,90 +2191,1315 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
return false;
}
+#define REPL_PLUGIN_NAME "pgoutput_repack"
+
/*
- * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
*/
-void
-repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+static void
+begin_concurrent_repack(Relation rel)
{
- ListCell *lc;
- ClusterParams params = {0};
- bool verbose = false;
- Relation rel = NULL;
- Oid indexOid = InvalidOid;
- MemoryContext repack_context;
- List *rtcs;
+ Oid toastrelid;
- /* Parse option list */
- foreach(lc, stmt->params)
+ /* Avoid logical decoding of other relations by this backend. */
+ repacked_rel_locator = rel->rd_locator;
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
{
- DefElem *opt = (DefElem *) lfirst(lc);
+ Relation toastrel;
- if (strcmp(opt->defname, "verbose") == 0)
- verbose = defGetBoolean(opt);
- else
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("unrecognized REPACK option \"%s\"",
- opt->defname),
- parser_errposition(pstate, opt->location)));
+ /* Avoid logical decoding of other TOAST relations. */
+ toastrel = table_open(toastrelid, AccessShareLock);
+ repacked_rel_toast_locator = toastrel->rd_locator;
+ table_close(toastrel, AccessShareLock);
}
+}
- params.options = (verbose ? CLUOPT_VERBOSE : 0);
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+ /*
+ * Restore normal function of (future) logical decoding for this backend.
+ */
+ repacked_rel_locator.relNumber = InvalidOid;
+ repacked_rel_toast_locator.relNumber = InvalidOid;
+}
- if (stmt->relation != NULL)
- {
- /* This is the single-relation case. */
- rel = process_single_relation(stmt->relation, stmt->indexname,
- ¶ms, CLUSTER_COMMAND_REPACK,
- &indexOid);
- if (rel == NULL)
- return;
- }
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+ LogicalDecodingContext *ctx;
+ RepackDecodingState *dstate;
/*
- * By here, we know we are in a multi-table situation. In order to avoid
- * holding locks for too long, we want to process each table in its own
- * transaction. This forces us to disallow running inside a user
- * transaction block.
+ * Check if we can use logical decoding.
*/
- PreventInTransactionBlock(isTopLevel, "REPACK");
+ CheckSlotPermissions();
+ CheckLogicalDecodingRequirements();
- /* Also, we need a memory context to hold our list of relations */
- repack_context = AllocSetContextCreate(PortalContext,
- "Repack",
- ALLOCSET_DEFAULT_SIZES);
+ /* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+ ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
- params.options |= CLUOPT_RECHECK;
- if (rel != NULL)
- {
- Oid relid;
- bool rel_is_index;
+ /*
+ * Neither prepare_write nor do_write callback nor update_progress is
+ * useful for us.
+ *
+ * Regarding the value of need_full_snapshot, we pass false because the
+ * table we are processing is present in RepackedRelsHash and therefore,
+ * regarding logical decoding, treated like a catalog.
+ */
+ ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+ NIL,
+ false,
+ InvalidXLogRecPtr,
+ XL_ROUTINE(.page_read = read_local_xlog_page,
+ .segment_open = wal_segment_open,
+ .segment_close = wal_segment_close),
+ NULL, NULL, NULL);
- Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /*
+ * We don't have control on setting fast_forward, so at least check it.
+ */
+ Assert(!ctx->fast_forward);
- if (OidIsValid(indexOid))
- {
- relid = indexOid;
- rel_is_index = true;
- }
- else
+ DecodingContextFindStartpoint(ctx);
+
+ /* Some WAL records should have been read. */
+ Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+ XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+ wal_segment_size);
+
+ /*
+ * Setup structures to store decoded changes.
+ */
+ dstate = palloc0(sizeof(RepackDecodingState));
+ dstate->relid = relid;
+ dstate->tstore = tuplestore_begin_heap(false, false,
+ maintenance_work_mem);
+
+ dstate->tupdesc = tupdesc;
+
+ /* Initialize the descriptor to store the changes ... */
+ dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+ TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+ /* ... as well as the corresponding slot. */
+ dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+ &TTSOpsMinimalTuple);
+
+ dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+ "logical decoding");
+
+ ctx->output_writer_private = dstate;
+ return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+ HeapTupleData tup_data;
+ HeapTuple result;
+ char *src;
+
+ /*
+ * Ensure alignment before accessing the fields. (This is why we can't use
+ * heap_copytuple() instead of this function.)
+ */
+ src = change + offsetof(ConcurrentChange, tup_data);
+ memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+ result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+ memcpy(result, &tup_data, sizeof(HeapTupleData));
+ result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+ src = change + SizeOfConcurrentChange;
+ memcpy(result->t_data, src, result->t_len);
+
+ return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal)
+{
+ RepackDecodingState *dstate;
+ ResourceOwner resowner_old;
+
+ /*
+ * Invalidate the "present" cache before moving to "(recent) history".
+ */
+ InvalidateSystemCaches();
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+ resowner_old = CurrentResourceOwner;
+ CurrentResourceOwner = dstate->resowner;
+
+ PG_TRY();
+ {
+ while (ctx->reader->EndRecPtr < end_of_wal)
{
- relid = RelationGetRelid(rel);
- rel_is_index = false;
- }
- rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
- rel_is_index,
- CLUSTER_COMMAND_REPACK);
+ XLogRecord *record;
+ XLogSegNo segno_new;
+ char *errm = NULL;
+ XLogRecPtr end_lsn;
- /* close relation, releasing lock on parent table */
- table_close(rel, AccessExclusiveLock);
- }
- else
- rtcs = get_tables_to_repack(repack_context);
+ record = XLogReadRecord(ctx->reader, &errm);
+ if (errm)
+ elog(ERROR, "%s", errm);
+
+ if (record != NULL)
+ LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+ /*
+ * If WAL segment boundary has been crossed, inform the decoding
+ * system that the catalog_xmin can advance. (We can confirm more
+ * often, but a filling a single WAL segment should not take much
+ * time.)
+ */
+ end_lsn = ctx->reader->EndRecPtr;
+ XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+ if (segno_new != repack_current_segment)
+ {
+ LogicalConfirmReceivedLocation(end_lsn);
+ elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+ (uint32) (end_lsn >> 32), (uint32) end_lsn);
+ repack_current_segment = segno_new;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ }
+ PG_CATCH();
+ {
+ /* clear all timetravel entries */
+ InvalidateSystemCaches();
+ CurrentResourceOwner = resowner_old;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+ ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+ TupleTableSlot *index_slot,
+ *ident_slot;
+ HeapTuple tup_old = NULL;
+
+ if (dstate->nchanges == 0)
+ return;
+
+ /* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+ index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+ /* A slot to fetch tuples from identity index. */
+ ident_slot = table_slot_create(rel, NULL);
+
+ while (tuplestore_gettupleslot(dstate->tstore, true, false,
+ dstate->tsslot))
+ {
+ bool shouldFree;
+ HeapTuple tup_change,
+ tup,
+ tup_exist;
+ char *change_raw,
+ *src;
+ ConcurrentChange change;
+ bool isnull[1];
+ Datum values[1];
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get the change from the single-column tuple. */
+ tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+ heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+ Assert(!isnull[0]);
+
+ /* Make sure we access aligned data. */
+ change_raw = (char *) DatumGetByteaP(values[0]);
+ src = (char *) VARDATA(change_raw);
+ memcpy(&change, src, SizeOfConcurrentChange);
+
+ /* TRUNCATE change contains no tuple, so process it separately. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ /*
+ * All the things that ExecuteTruncateGuts() does (such as firing
+ * triggers or handling the DROP_CASCADE behavior) should have
+ * taken place on the source relation. Thus we only do the actual
+ * truncation of the new relation (and its indexes).
+ */
+ heap_truncate_one_rel(rel);
+
+ pfree(tup_change);
+ continue;
+ }
+
+ /*
+ * Extract the tuple from the change. The tuple is copied here because
+ * it might be assigned to 'tup_old', in which case it needs to
+ * survive into the next iteration.
+ */
+ tup = get_changed_tuple(src);
+
+ if (change.kind == CHANGE_UPDATE_OLD)
+ {
+ Assert(tup_old == NULL);
+ tup_old = tup;
+ }
+ else if (change.kind == CHANGE_INSERT)
+ {
+ Assert(tup_old == NULL);
+
+ apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+ pfree(tup);
+ }
+ else if (change.kind == CHANGE_UPDATE_NEW ||
+ change.kind == CHANGE_DELETE)
+ {
+ IndexScanDesc ind_scan = NULL;
+ HeapTuple tup_key;
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ {
+ tup_key = tup_old != NULL ? tup_old : tup;
+ }
+ else
+ {
+ Assert(tup_old == NULL);
+ tup_key = tup;
+ }
+
+ /*
+ * Find the tuple to be updated or deleted.
+ */
+ tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+ iistate, ident_slot, &ind_scan);
+ if (tup_exist == NULL)
+ elog(ERROR, "Failed to find target tuple");
+
+ if (change.kind == CHANGE_UPDATE_NEW)
+ apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+ index_slot);
+ else
+ apply_concurrent_delete(rel, tup_exist, &change);
+
+ if (tup_old != NULL)
+ {
+ pfree(tup_old);
+ tup_old = NULL;
+ }
+
+ pfree(tup);
+ index_endscan(ind_scan);
+ }
+ else
+ elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+ /*
+ * If a change was applied now, increment CID for next writes and
+ * update the snapshot so it sees the changes we've applied so far.
+ */
+ if (change.kind != CHANGE_UPDATE_OLD)
+ {
+ CommandCounterIncrement();
+ UpdateActiveSnapshotCommandId();
+ }
+
+ /* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+ Assert(shouldFree);
+ pfree(tup_change);
+ }
+
+ tuplestore_clear(dstate->tstore);
+ dstate->nchanges = 0;
+
+ /* Cleanup. */
+ ExecDropSingleTupleTableSlot(index_slot);
+ ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+ IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+ List *recheck;
+
+
+ /*
+ * Like simple_heap_insert(), but make sure that the INSERT is not
+ * logically decoded - see reform_and_rewrite_tuple() for more
+ * information.
+ */
+ heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
+ NULL);
+
+ /*
+ * Update indexes.
+ *
+ * In case functions in the index need the active snapshot and caller
+ * hasn't set one.
+ */
+ ExecStoreHeapTuple(tup, index_slot, false);
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ false, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ false /* onlySummarizing */
+ );
+
+ /*
+ * If recheck is required, it must have been preformed on the source
+ * relation by now. (All the logical changes we process here are already
+ * committed.)
+ */
+ list_free(recheck);
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+ ConcurrentChange *change, IndexInsertState *iistate,
+ TupleTableSlot *index_slot)
+{
+ LockTupleMode lockmode;
+ TM_FailureData tmfd;
+ TU_UpdateIndexes update_indexes;
+ TM_Result res;
+ List *recheck;
+
+ /*
+ * Write the new tuple into the new heap. ('tup' gets the TID assigned
+ * here.)
+ *
+ * Do it like in simple_heap_update(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_update(rel, &tup_target->t_self, tup,
+ GetCurrentCommandId(true),
+ InvalidSnapshot,
+ false, /* no wait - only we are doing changes */
+ &tmfd, &lockmode, &update_indexes,
+ false /* wal_logical */);
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+ ExecStoreHeapTuple(tup, index_slot, false);
+
+ if (update_indexes != TU_None)
+ {
+ recheck = ExecInsertIndexTuples(iistate->rri,
+ index_slot,
+ iistate->estate,
+ true, /* update */
+ false, /* noDupErr */
+ NULL, /* specConflict */
+ NIL, /* arbiterIndexes */
+ /* onlySummarizing */
+ update_indexes == TU_Summarizing);
+ list_free(recheck);
+ }
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+ ConcurrentChange *change)
+{
+ TM_Result res;
+ TM_FailureData tmfd;
+
+ /*
+ * Delete tuple from the new heap.
+ *
+ * Do it like in simple_heap_delete(), except for 'wal_logical' (and
+ * except for 'wait').
+ */
+ res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
+ InvalidSnapshot, false,
+ &tmfd,
+ false, /* no wait - only we are doing changes */
+ false /* wal_logical */);
+
+ if (res != TM_Ok)
+ ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+ pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+ IndexInsertState *iistate,
+ TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+ IndexScanDesc scan;
+ Form_pg_index ident_form;
+ int2vector *ident_indkey;
+ HeapTuple result = NULL;
+
+ /* XXX no instrumentation for now */
+ scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+ NULL, nkeys, 0);
+ *scan_p = scan;
+ index_rescan(scan, key, nkeys, NULL, 0);
+
+ /* Info needed to retrieve key values from heap tuple. */
+ ident_form = iistate->ident_index->rd_index;
+ ident_indkey = &ident_form->indkey;
+
+ /* Use the incoming tuple to finalize the scan key. */
+ for (int i = 0; i < scan->numberOfKeys; i++)
+ {
+ ScanKey entry;
+ bool isnull;
+ int16 attno_heap;
+
+ entry = &scan->keyData[i];
+ attno_heap = ident_indkey->values[i];
+ entry->sk_argument = heap_getattr(tup_key,
+ attno_heap,
+ rel->rd_att,
+ &isnull);
+ Assert(!isnull);
+ }
+ if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+ {
+ bool shouldFree;
+
+ result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+ /* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+ Assert(!shouldFree);
+ }
+
+ return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+ Relation rel_dst, Relation rel_src, ScanKey ident_key,
+ int ident_key_nentries, IndexInsertState *iistate)
+{
+ RepackDecodingState *dstate;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_CATCH_UP);
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ repack_decode_concurrent_changes(ctx, end_of_wal);
+
+ if (dstate->nchanges == 0)
+ return;
+
+ PG_TRY();
+ {
+ /*
+ * Make sure that TOAST values can eventually be accessed via the old
+ * relation - see comment in copy_table_data().
+ */
+ if (rel_src)
+ rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+ apply_concurrent_changes(dstate, rel_dst, ident_key,
+ ident_key_nentries, iistate);
+ }
+ PG_FINALLY();
+ {
+ if (rel_src)
+ rel_dst->rd_toastoid = InvalidOid;
+ }
+ PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+ EState *estate;
+ int i;
+ IndexInsertState *result;
+
+ result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+ estate = CreateExecutorState();
+
+ result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+ InitResultRelInfo(result->rri, relation, 0, 0, 0);
+ ExecOpenIndices(result->rri, false);
+
+ /*
+ * Find the relcache entry of the identity index so that we spend no extra
+ * effort to open / close it.
+ */
+ for (i = 0; i < result->rri->ri_NumIndices; i++)
+ {
+ Relation ind_rel;
+
+ ind_rel = result->rri->ri_IndexRelationDescs[i];
+ if (ind_rel->rd_id == ident_index_id)
+ result->ident_index = ind_rel;
+ }
+ if (result->ident_index == NULL)
+ elog(ERROR, "Failed to open identity index");
+
+ /* Only initialize fields needed by ExecInsertIndexTuples(). */
+ result->estate = estate;
+
+ return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+ Relation ident_idx_rel;
+ Form_pg_index ident_idx;
+ int n,
+ i;
+ ScanKey result;
+
+ Assert(OidIsValid(ident_idx_oid));
+ ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+ ident_idx = ident_idx_rel->rd_index;
+ n = ident_idx->indnatts;
+ result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+ for (i = 0; i < n; i++)
+ {
+ ScanKey entry;
+ int16 relattno;
+ Form_pg_attribute att;
+ Oid opfamily,
+ opcintype,
+ opno,
+ opcode;
+
+ entry = &result[i];
+ relattno = ident_idx->indkey.values[i];
+ if (relattno >= 1)
+ {
+ TupleDesc desc;
+
+ desc = rel_src->rd_att;
+ att = TupleDescAttr(desc, relattno - 1);
+ }
+ else
+ elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+ opfamily = ident_idx_rel->rd_opfamily[i];
+ opcintype = ident_idx_rel->rd_opcintype[i];
+ opno = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ if (!OidIsValid(opno))
+ elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+ opcode = get_opcode(opno);
+ if (!OidIsValid(opcode))
+ elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+ /* Initialize everything but argument. */
+ ScanKeyInit(entry,
+ i + 1,
+ BTEqualStrategyNumber, opcode,
+ (Datum) NULL);
+ entry->sk_collation = att->attcollation;
+ }
+ index_close(ident_idx_rel, AccessShareLock);
+
+ *nentries = n;
+ return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+ ExecCloseIndices(iistate->rri);
+ FreeExecutorState(iistate->estate);
+ pfree(iistate->rri);
+ pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ ExecDropSingleTupleTableSlot(dstate->tsslot);
+ FreeTupleDesc(dstate->tupdesc_change);
+ FreeTupleDesc(dstate->tupdesc);
+ tuplestore_end(dstate->tstore);
+
+ FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+ Relation cl_index,
+ LogicalDecodingContext *ctx,
+ bool swap_toast_by_content,
+ TransactionId frozenXid,
+ MultiXactId cutoffMulti)
+{
+ LOCKMODE lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+ List *ind_oids_new;
+ Oid old_table_oid = RelationGetRelid(OldHeap);
+ Oid new_table_oid = RelationGetRelid(NewHeap);
+ List *ind_oids_old = RelationGetIndexList(OldHeap);
+ ListCell *lc,
+ *lc2;
+ char relpersistence;
+ bool is_system_catalog;
+ Oid ident_idx_old,
+ ident_idx_new;
+ IndexInsertState *iistate;
+ ScanKey ident_key;
+ int ident_key_nentries;
+ XLogRecPtr wal_insert_ptr,
+ end_of_wal;
+ char dummy_rec_data = '\0';
+ Relation *ind_refs,
+ *ind_refs_p;
+ int nind;
+
+ /* Like in cluster_rel(). */
+ lockmode_old = ShareUpdateExclusiveLock;
+ Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+ Assert(cl_index == NULL ||
+ CheckRelationLockedByMe(cl_index, lockmode_old, false));
+ /* This is expected from the caller. */
+ Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+ ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+ /*
+ * Unlike the exclusive case, we build new indexes for the new relation
+ * rather than swapping the storage and reindexing the old relation. The
+ * point is that the index build can take some time, so we do it before we
+ * get AccessExclusiveLock on the old heap and therefore we cannot swap
+ * the heap storage yet.
+ *
+ * index_create() will lock the new indexes using AccessExclusiveLock - no
+ * need to change that.
+ *
+ * We assume that ShareUpdateExclusiveLock on the table prevents anyone
+ * from dropping the existing indexes or adding new ones, so the lists of
+ * old and new indexes should match at the swap time. On the other hand we
+ * do not block ALTER INDEX commands that do not require table lock
+ * (e.g. ALTER INDEX ... SET ...).
+ *
+ * XXX Should we check a the end of our work if another transaction
+ * executed such a command and issue a NOTICE that we might have discarded
+ * its effects? (For example, someone changes storage parameter after we
+ * have created the new index, the new value of that parameter is lost.)
+ * Alternatively, we can lock all the indexes now in a mode that blocks
+ * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep
+ * them locked till the end of the transactions. That might increase the
+ * risk of deadlock during the lock upgrade below, however SELECT / DML
+ * queries should not be involved in such a deadlock.
+ */
+ ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+ /*
+ * Processing shouldn't start w/o valid identity index.
+ */
+ Assert(OidIsValid(ident_idx_old));
+
+ /* Find "identity index" on the new relation. */
+ ident_idx_new = InvalidOid;
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+
+ if (ident_idx_old == ind_old)
+ {
+ ident_idx_new = ind_new;
+ break;
+ }
+ }
+ if (!OidIsValid(ident_idx_new))
+ /*
+ * Should not happen, given our lock on the old relation.
+ */
+ ereport(ERROR,
+ (errmsg("Identity index missing on the new relation")));
+
+ /* Executor state to update indexes. */
+ iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+ /*
+ * Build scan key that we'll use to look for rows to be updated / deleted
+ * during logical decoding.
+ */
+ ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+ /*
+ * Flush all WAL records inserted so far (possibly except for the last
+ * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+ * we need to flush while holding exclusive lock on the source table.
+ */
+ wal_insert_ptr = GetInsertRecPtr();
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /*
+ * Apply concurrent changes first time, to minimize the time we need to
+ * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+ * written during the data copying and index creation.)
+ */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /*
+ * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+ * is one), all its indexes, so that we can swap the files.
+ *
+ * Before that, unlock the index temporarily to avoid deadlock in case
+ * another transaction is trying to lock it while holding the lock on the
+ * table.
+ */
+ if (cl_index)
+ {
+ index_close(cl_index, ShareUpdateExclusiveLock);
+ cl_index = NULL;
+ }
+ /* For the same reason, unlock TOAST relation. */
+ if (OldHeap->rd_rel->reltoastrelid)
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+ /* Finally lock the table */
+ LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+ /*
+ * Lock all indexes now, not only the clustering one: all indexes need to
+ * have their files swapped. While doing that, store their relation
+ * references in an array, to handle predicate locks below.
+ */
+ ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+ nind = 0;
+ foreach(lc, ind_oids_old)
+ {
+ Oid ind_oid;
+ Relation index;
+
+ ind_oid = lfirst_oid(lc);
+ index = index_open(ind_oid, AccessExclusiveLock);
+ /*
+ * TODO 1) Do we need to check if ALTER INDEX was executed since the
+ * new index was created in build_new_indexes()? 2) Specifically for
+ * the clustering index, should check_index_is_clusterable() be called
+ * here? (Not sure about the latter: ShareUpdateExclusiveLock on the
+ * table probably blocks all commands that affect the result of
+ * check_index_is_clusterable().)
+ */
+ *ind_refs_p = index;
+ ind_refs_p++;
+ nind++;
+ }
+
+ /*
+ * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+ * lock is needed to swap the files.
+ */
+ if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+ LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+ /*
+ * Tuples and pages of the old heap will be gone, but the heap will stay.
+ */
+ TransferPredicateLocksToHeapRelation(OldHeap);
+ /* The same for indexes. */
+ for (int i = 0; i < nind; i++)
+ {
+ Relation index = ind_refs[i];
+
+ TransferPredicateLocksToHeapRelation(index);
+
+ /*
+ * References to indexes on the old relation are not needed anymore,
+ * however locks stay till the end of the transaction.
+ */
+ index_close(index, NoLock);
+ }
+ pfree(ind_refs);
+
+ /*
+ * Flush anything we see in WAL, to make sure that all changes committed
+ * while we were waiting for the exclusive lock are available for
+ * decoding. This should not be necessary if all backends had
+ * synchronous_commit set, but we can't rely on this setting.
+ *
+ * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+ * position, and GetLastImportantRecPtr() points at the start of the last
+ * record rather than at the end. Thus the simplest way to determine the
+ * insert position is to insert a dummy record and use its LSN.
+ *
+ * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+ * last record (plus the total size of all the page headers the record
+ * spans)?
+ */
+ XLogBeginInsert();
+ XLogRegisterData(&dummy_rec_data, 1);
+ wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+ XLogFlush(wal_insert_ptr);
+ end_of_wal = GetFlushRecPtr(NULL);
+
+ /* Apply the concurrent changes again. */
+ process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate);
+
+ /* Remember info about rel before closing OldHeap */
+ relpersistence = OldHeap->rd_rel->relpersistence;
+ is_system_catalog = IsSystemRelation(OldHeap);
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+ /*
+ * Even ShareUpdateExclusiveLock should have prevented others from
+ * creating / dropping indexes (even using the CONCURRENTLY option), so we
+ * do not need to check whether the lists match.
+ */
+ forboth(lc, ind_oids_old, lc2, ind_oids_new)
+ {
+ Oid ind_old = lfirst_oid(lc);
+ Oid ind_new = lfirst_oid(lc2);
+ Oid mapped_tables[4];
+
+ /* Zero out possible results from swapped_relation_files */
+ memset(mapped_tables, 0, sizeof(mapped_tables));
+
+ swap_relation_files(ind_old, ind_new,
+ (old_table_oid == RelationRelationId),
+ swap_toast_by_content,
+ true,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+ /*
+ * Concurrent processing is not supported for system relations, so
+ * there should be no mapped tables.
+ */
+ for (int i = 0; i < 4; i++)
+ Assert(mapped_tables[i] == 0);
+#endif
+ }
+
+ /* The new indexes must be visible for deletion. */
+ CommandCounterIncrement();
+
+ /* Close the old heap but keep lock until transaction commit. */
+ table_close(OldHeap, NoLock);
+ /* Close the new heap. (We didn't have to open its indexes). */
+ table_close(NewHeap, NoLock);
+
+ /* Cleanup what we don't need anymore. (And close the identity index.) */
+ pfree(ident_key);
+ free_index_insert_state(iistate);
+
+ /*
+ * Swap the relations and their TOAST relations and TOAST indexes. This
+ * also drops the new relation and its indexes.
+ *
+ * (System catalogs are currently not supported.)
+ */
+ Assert(!is_system_catalog);
+ finish_heap_swap(old_table_oid, new_table_oid,
+ is_system_catalog,
+ swap_toast_by_content,
+ false, true, false,
+ frozenXid, cutoffMulti,
+ relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+ StringInfo ind_name;
+ ListCell *lc;
+ List *result = NIL;
+
+ pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+ PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+ ind_name = makeStringInfo();
+
+ foreach(lc, OldIndexes)
+ {
+ Oid ind_oid,
+ ind_oid_new,
+ tbsp_oid;
+ Relation ind;
+ IndexInfo *ind_info;
+ int i,
+ heap_col_id;
+ List *colnames;
+ int16 indnatts;
+ Oid *collations,
+ *opclasses;
+ HeapTuple tup;
+ bool isnull;
+ Datum d;
+ oidvector *oidvec;
+ int2vector *int2vec;
+ size_t oid_arr_size;
+ size_t int2_arr_size;
+ int16 *indoptions;
+ text *reloptions = NULL;
+ bits16 flags;
+ Datum *opclassOptions;
+ NullableDatum *stattargets;
+
+ ind_oid = lfirst_oid(lc);
+ ind = index_open(ind_oid, AccessShareLock);
+ ind_info = BuildIndexInfo(ind);
+
+ tbsp_oid = ind->rd_rel->reltablespace;
+
+ /*
+ * Index name really doesn't matter, we'll eventually use only their
+ * storage. Just make them unique within the table.
+ */
+ resetStringInfo(ind_name);
+ appendStringInfo(ind_name, "ind_%d",
+ list_cell_number(OldIndexes, lc));
+
+ flags = 0;
+ if (ind->rd_index->indisprimary)
+ flags |= INDEX_CREATE_IS_PRIMARY;
+
+ colnames = NIL;
+ indnatts = ind->rd_index->indnatts;
+ oid_arr_size = sizeof(Oid) * indnatts;
+ int2_arr_size = sizeof(int16) * indnatts;
+
+ collations = (Oid *) palloc(oid_arr_size);
+ for (i = 0; i < indnatts; i++)
+ {
+ char *colname;
+
+ heap_col_id = ind->rd_index->indkey.values[i];
+ if (heap_col_id > 0)
+ {
+ Form_pg_attribute att;
+
+ /* Normal attribute. */
+ att = TupleDescAttr(OldHeap->rd_att, heap_col_id - 1);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ }
+ else if (heap_col_id == 0)
+ {
+ HeapTuple tuple;
+ Form_pg_attribute att;
+
+ /*
+ * Expression column is not present in relcache. What we need
+ * here is an attribute of the *index* relation.
+ */
+ tuple = SearchSysCache2(ATTNUM,
+ ObjectIdGetDatum(ind_oid),
+ Int16GetDatum(i + 1));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR,
+ "cache lookup failed for attribute %d of relation %u",
+ i + 1, ind_oid);
+ att = (Form_pg_attribute) GETSTRUCT(tuple);
+ colname = pstrdup(NameStr(att->attname));
+ collations[i] = att->attcollation;
+ ReleaseSysCache(tuple);
+ }
+ else
+ elog(ERROR, "Unexpected column number: %d",
+ heap_col_id);
+
+ colnames = lappend(colnames, colname);
+ }
+
+ /*
+ * Special effort needed for variable length attributes of
+ * Form_pg_index.
+ */
+ tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index %u", ind_oid);
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ oidvec = (oidvector *) DatumGetPointer(d);
+ opclasses = (Oid *) palloc(oid_arr_size);
+ memcpy(opclasses, oidvec->values, oid_arr_size);
+
+ d = SysCacheGetAttr(INDEXRELID, tup, Anum_pg_index_indoption,
+ &isnull);
+ Assert(!isnull);
+ int2vec = (int2vector *) DatumGetPointer(d);
+ indoptions = (int16 *) palloc(int2_arr_size);
+ memcpy(indoptions, int2vec->values, int2_arr_size);
+ ReleaseSysCache(tup);
+
+ tup = SearchSysCache1(RELOID, ObjectIdGetDatum(ind_oid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for index relation %u", ind_oid);
+ d = SysCacheGetAttr(RELOID, tup, Anum_pg_class_reloptions, &isnull);
+ reloptions = !isnull ? DatumGetTextPCopy(d) : NULL;
+ ReleaseSysCache(tup);
+
+ opclassOptions = palloc0(sizeof(Datum) * ind_info->ii_NumIndexAttrs);
+ for (i = 0; i < ind_info->ii_NumIndexAttrs; i++)
+ opclassOptions[i] = get_attoptions(ind_oid, i + 1);
+
+ stattargets = get_index_stattargets(ind_oid, ind_info);
+
+ /*
+ * Neither parentIndexRelid nor parentConstraintId needs to be passed
+ * since the new catalog entries (pg_constraint, pg_inherits) would
+ * eventually be dropped. Therefore there's no need to record valid
+ * dependency on parents.
+ */
+ ind_oid_new = index_create(NewHeap,
+ ind_name->data,
+ InvalidOid,
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidOid,
+ ind_info,
+ colnames,
+ ind->rd_rel->relam,
+ tbsp_oid,
+ collations,
+ opclasses,
+ opclassOptions,
+ indoptions,
+ stattargets,
+ PointerGetDatum(reloptions),
+ flags, /* flags */
+ 0, /* constr_flags */
+ false, /* allow_system_table_mods */
+ false, /* is_internal */
+ NULL /* constraintId */
+ );
+ result = lappend_oid(result, ind_oid_new);
+
+ index_close(ind, AccessShareLock);
+ list_free_deep(colnames);
+ pfree(collations);
+ pfree(opclasses);
+ pfree(indoptions);
+ if (reloptions)
+ pfree(reloptions);
+ }
+
+ return result;
+}
+
+/*
+ * REPACK is intended to be a replacement of both CLUSTER and VACUUM FULL.
+ */
+void
+repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
+{
+ ListCell *lc;
+ ClusterParams params = {0};
+ bool verbose = false;
+ Relation rel = NULL;
+ Oid indexOid = InvalidOid;
+ MemoryContext repack_context;
+ List *rtcs;
+ LOCKMODE lockmode;
+
+ /* Parse option list */
+ foreach(lc, stmt->params)
+ {
+ DefElem *opt = (DefElem *) lfirst(lc);
+
+ if (strcmp(opt->defname, "verbose") == 0)
+ verbose = defGetBoolean(opt);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized REPACK option \"%s\"",
+ opt->defname),
+ parser_errposition(pstate, opt->location)));
+ }
+
+ params.options =
+ (verbose ? CLUOPT_VERBOSE : 0) |
+ (stmt->concurrent ? CLUOPT_CONCURRENT : 0);
+
+ /*
+ * Determine the lock mode expected by cluster_rel().
+ *
+ * In the exclusive case, we obtain AccessExclusiveLock right away to
+ * avoid lock-upgrade hazard in the single-transaction case. In the
+ * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+ * of processing, supposedly for very short time. Until then, we'll have
+ * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+ */
+ lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+ AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+ if (stmt->relation != NULL)
+ {
+ /* This is the single-relation case. */
+ rel = process_single_relation(stmt->relation, stmt->indexname,
+ lockmode, isTopLevel, ¶ms,
+ CLUSTER_COMMAND_REPACK, &indexOid);
+ if (rel == NULL)
+ return;
+ }
+
+ /*
+ * By here, we know we are in a multi-table situation.
+ *
+ * Concurrent processing is currently considered rather special (e.g. in
+ * terms of resources consumed) so it is not performed in bulk.
+ */
+ if (params.options & CLUOPT_CONCURRENT)
+ {
+ if (rel != NULL)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+ errhint("Consider running the command for individual partitions.")));
+ }
+ else
+ ereport(ERROR,
+ (errmsg("REPACK CONCURRENTLY requires explicit table name")));
+ }
+
+ /*
+ * In order to avoid holding locks for too long, we want to process each
+ * table in its own transaction. This forces us to disallow running
+ * inside a user transaction block.
+ */
+ PreventInTransactionBlock(isTopLevel, "REPACK");
+
+ /* Also, we need a memory context to hold our list of relations */
+ repack_context = AllocSetContextCreate(PortalContext,
+ "Repack",
+ ALLOCSET_DEFAULT_SIZES);
+
+ params.options |= CLUOPT_RECHECK;
+ if (rel != NULL)
+ {
+ Oid relid;
+ bool rel_is_index;
+
+ Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+ /* See the ereport() above. */
+ Assert((params.options & CLUOPT_CONCURRENT) == 0);
+
+ if (OidIsValid(indexOid))
+ {
+ relid = indexOid;
+ rel_is_index = true;
+ }
+ else
+ {
+ relid = RelationGetRelid(rel);
+ rel_is_index = false;
+ }
+ rtcs = get_tables_to_cluster_partitioned(repack_context, relid,
+ rel_is_index,
+ CLUSTER_COMMAND_REPACK);
+
+ /* close relation, releasing lock on parent table */
+ table_close(rel, lockmode);
+ }
+ else
+ rtcs = get_tables_to_repack(repack_context);
+
+ /* Do the job. */
+ cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK, lockmode,
+ isTopLevel);
- /* Do the job. */
- cluster_multiple_rels(rtcs, ¶ms, CLUSTER_COMMAND_REPACK);
/* Start a new transaction for the cleanup work. */
StartTransactionCommand();
@@ -1934,6 +3516,7 @@ repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
*/
static Relation
process_single_relation(RangeVar *relation, char *indexname,
+ LOCKMODE lockmode, bool isTopLevel,
ClusterParams *params, ClusterCommand cmd,
Oid *indexOid_p)
{
@@ -1944,12 +3527,10 @@ process_single_relation(RangeVar *relation, char *indexname,
Oid tableOid;
/*
- * Find, lock, and check permissions on the table. We obtain
- * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
- * single-transaction case.
+ * Find, lock, and check permissions on the table.
*/
tableOid = RangeVarGetRelidExtended(relation,
- AccessExclusiveLock,
+ lockmode,
0,
RangeVarCallbackMaintainsTable,
NULL);
@@ -2013,7 +3594,7 @@ process_single_relation(RangeVar *relation, char *indexname,
/* For non-partitioned tables, do what we came here to do. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
{
- cluster_rel(rel, indexOid, params, cmd);
+ cluster_rel(rel, indexOid, params, cmd, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
return NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 188e26f0e6e..71b73c21ebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
static void
refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
{
- finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+ finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
RecentXmin, ReadNextMultiXactId(), relpersistence);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b8837f26cb4..3c0ec0fc01c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5990,6 +5990,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
+ true,
RecentXmin,
ReadNextMultiXactId(),
persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8685942505c..8afbe4cfb84 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -125,7 +125,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
TransactionId lastSaneFrozenXid,
MultiXactId lastSaneMinMulti);
static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
- BufferAccessStrategy bstrategy);
+ BufferAccessStrategy bstrategy, bool isTopLevel);
static double compute_parallel_delay(void);
static VacOptValue get_vacoptval_from_boolean(DefElem *def);
static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -633,7 +633,8 @@ vacuum(List *relations, const VacuumParams params, BufferAccessStrategy bstrateg
if (params.options & VACOPT_VACUUM)
{
- if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+ if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+ isTopLevel))
continue;
}
@@ -1997,7 +1998,7 @@ vac_truncate_clog(TransactionId frozenXID,
*/
static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
- BufferAccessStrategy bstrategy)
+ BufferAccessStrategy bstrategy, bool isTopLevel)
{
LOCKMODE lmode;
Relation rel;
@@ -2288,7 +2289,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
cluster_rel(rel, InvalidOid, &cluster_params,
- CLUSTER_COMMAND_VACUUM);
+ CLUSTER_COMMAND_VACUUM, isTopLevel);
/* cluster_rel closes the relation, but keeps lock */
rel = NULL;
@@ -2331,7 +2332,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
toast_vacuum_params.toast_parent = relid;
- vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy);
+ vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy,
+ isTopLevel);
}
/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 2b0db214804..50aa385a581 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
subdir('jit/llvm')
subdir('replication/libpqwalreceiver')
subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
subdir('snowball')
subdir('utils/mb/conversion_procs')
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 15b2b5e93ce..a6c8c59dfc5 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -11895,27 +11895,30 @@ cluster_index_specification:
*
* QUERY:
* REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ * REPACK [ (options) ] CONCURRENTLY <qualified_name> [ USING INDEX <index_name> ]
*
*****************************************************************************/
RepackStmt:
- REPACK opt_repack_args
+ REPACK opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $2 ? (RangeVar *) linitial($2) : NULL;
- n->indexname = $2 ? (char *) lsecond($2) : NULL;
+ n->relation = $3 ? (RangeVar *) linitial($3) : NULL;
+ n->indexname = $3 ? (char *) lsecond($3) : NULL;
n->params = NIL;
+ n->concurrent = $2;
$$ = (Node *) n;
}
- | REPACK '(' utility_option_list ')' opt_repack_args
+ | REPACK '(' utility_option_list ')' opt_concurrently opt_repack_args
{
RepackStmt *n = makeNode(RepackStmt);
- n->relation = $5 ? (RangeVar *) linitial($5) : NULL;
- n->indexname = $5 ? (char *) lsecond($5) : NULL;
+ n->relation = $6 ? (RangeVar *) linitial($6) : NULL;
+ n->indexname = $6 ? (char *) lsecond($6) : NULL;
n->params = $3;
+ n->concurrent = $5;
$$ = (Node *) n;
}
;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index cc03f0706e9..5dc4ae58ffe 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
+#include "commands/cluster.h"
#include "replication/decode.h"
#include "replication/logical.h"
#include "replication/message.h"
@@ -472,6 +473,88 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
TransactionId xid = XLogRecGetXid(buf->record);
SnapBuild *builder = ctx->snapshot_builder;
+ /*
+ * If the change is not intended for logical decoding, do not even
+ * establish transaction for it - REPACK CONCURRENTLY is the typical use
+ * case.
+ *
+ * First, check if REPACK CONCURRENTLY is being performed by this backend.
+ * If so, only decode data changes of the table that it is processing, and
+ * the changes of its TOAST relation.
+ *
+ * (TOAST locator should not be set unless the main is.)
+ */
+ Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ OidIsValid(repacked_rel_locator.relNumber));
+
+ if (OidIsValid(repacked_rel_locator.relNumber))
+ {
+ XLogReaderState *r = buf->record;
+ RelFileLocator locator;
+
+ /* Not all records contain the block. */
+ if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+ !RelFileLocatorEquals(locator, repacked_rel_locator) &&
+ (!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+ !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+ return;
+ }
+
+ /*
+ * Second, skip records which do not contain sufficient information for
+ * the decoding.
+ *
+ * The problem we solve here is that REPACK CONCURRENTLY generates WAL
+ * when doing changes in the new table. Those changes should not be useful
+ * for any other user (such as logical replication subscription) because
+ * the new table will eventually be dropped (after REPACK CONCURRENTLY has
+ * assigned its file to the "old table").
+ */
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *rec;
+
+ rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+ /*
+ * This does happen when 1) raw_heap_insert marks the TOAST
+ * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+ * replays inserts performed by other backends.
+ */
+ if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ {
+ xl_heap_update *rec;
+
+ rec = (xl_heap_update *) XLogRecGetData(buf->record);
+ if ((rec->flags &
+ (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_TUPLE |
+ XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+ return;
+
+ break;
+ }
+
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *rec;
+
+ rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+ if (rec->flags & XLH_DELETE_NO_LOGICAL)
+ return;
+ break;
+ }
+ }
+
ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 270f37ecadb..90a806e73fb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,26 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
return SnapBuildMVCCFromHistoric(snap, true);
}
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+ Snapshot snap;
+
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+
+ snap = SnapBuildBuildSnapshot(builder);
+ return SnapBuildMVCCFromHistoric(snap, false);
+}
+
/*
* Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
*
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+# src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ $(WIN32RES) \
+ pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+ rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+ 'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+ pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pgoutput_repack',
+ '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+ pgoutput_repack_sources,
+ kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ * Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+ OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, int nrelations,
+ Relation relations[],
+ ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+ ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+ AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+ cb->startup_cb = plugin_startup;
+ cb->begin_cb = plugin_begin_txn;
+ cb->change_cb = plugin_change;
+ cb->truncate_cb = plugin_truncate;
+ cb->commit_cb = plugin_commit_txn;
+ cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+ bool is_init)
+{
+ ctx->output_plugin_private = NULL;
+
+ /* Probably unnecessary, as we don't use the SQL interface ... */
+ opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+ if (ctx->output_plugin_options != NIL)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("This plugin does not expect any options")));
+ }
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ Relation relation, ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Only interested in one particular relation. */
+ if (relation->rd_id != dstate->relid)
+ return;
+
+ /* Decode entry depending on its type */
+ switch (change->action)
+ {
+ case REORDER_BUFFER_CHANGE_INSERT:
+ {
+ HeapTuple newtuple;
+
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ /*
+ * Identity checks in the main function should have made this
+ * impossible.
+ */
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete insert info.");
+
+ store_change(ctx, CHANGE_INSERT, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_UPDATE:
+ {
+ HeapTuple oldtuple,
+ newtuple;
+
+ oldtuple = change->data.tp.oldtuple != NULL ?
+ change->data.tp.oldtuple : NULL;
+ newtuple = change->data.tp.newtuple != NULL ?
+ change->data.tp.newtuple : NULL;
+
+ if (newtuple == NULL)
+ elog(ERROR, "Incomplete update info.");
+
+ if (oldtuple != NULL)
+ store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+ store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+ }
+ break;
+ case REORDER_BUFFER_CHANGE_DELETE:
+ {
+ HeapTuple oldtuple;
+
+ oldtuple = change->data.tp.oldtuple ?
+ change->data.tp.oldtuple : NULL;
+
+ if (oldtuple == NULL)
+ elog(ERROR, "Incomplete delete info.");
+
+ store_change(ctx, CHANGE_DELETE, oldtuple);
+ }
+ break;
+ default:
+ /* Should not come here */
+ Assert(false);
+ break;
+ }
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ RepackDecodingState *dstate;
+ int i;
+ Relation relation = NULL;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ /* Find the relation we are processing. */
+ for (i = 0; i < nrelations; i++)
+ {
+ relation = relations[i];
+
+ if (RelationGetRelid(relation) == dstate->relid)
+ break;
+ }
+
+ /* Is this truncation of another relation? */
+ if (i == nrelations)
+ return;
+
+ store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+ HeapTuple tuple)
+{
+ RepackDecodingState *dstate;
+ char *change_raw;
+ ConcurrentChange change;
+ bool flattened = false;
+ Size size;
+ Datum values[1];
+ bool isnull[1];
+ char *dst,
+ *dst_start;
+
+ dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+ size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+ if (tuple)
+ {
+ /*
+ * ReorderBufferCommit() stores the TOAST chunks in its private memory
+ * context and frees them after having called apply_change().
+ * Therefore we need flat copy (including TOAST) that we eventually
+ * copy into the memory context which is available to
+ * decode_concurrent_changes().
+ */
+ if (HeapTupleHasExternal(tuple))
+ {
+ /*
+ * toast_flatten_tuple_to_datum() might be more convenient but we
+ * don't want the decompression it does.
+ */
+ tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+ flattened = true;
+ }
+
+ size += tuple->t_len;
+ }
+
+ /* XXX Isn't there any function / macro to do this? */
+ if (size >= 0x3FFFFFFF)
+ elog(ERROR, "Change is too big.");
+
+ /* Construct the change. */
+ change_raw = (char *) palloc0(size);
+ SET_VARSIZE(change_raw, size);
+
+ /*
+ * Since the varlena alignment might not be sufficient for the structure,
+ * set the fields in a local instance and remember where it should
+ * eventually be copied.
+ */
+ change.kind = kind;
+ dst_start = (char *) VARDATA(change_raw);
+
+ /* No other information is needed for TRUNCATE. */
+ if (change.kind == CHANGE_TRUNCATE)
+ {
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+ goto store;
+ }
+
+ /*
+ * Copy the tuple.
+ *
+ * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+ */
+ memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+ dst = dst_start + SizeOfConcurrentChange;
+ memcpy(dst, tuple->t_data, tuple->t_len);
+
+ /* The data has been copied. */
+ if (flattened)
+ pfree(tuple);
+
+store:
+ /* Copy the structure so it can be stored. */
+ memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+ /* Store as tuple of 1 bytea column. */
+ values[0] = PointerGetDatum(change_raw);
+ isnull[0] = false;
+ tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+ values, isnull);
+
+ /* Accounting. */
+ dstate->nchanges++;
+
+ /* Cleanup. */
+ pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e9ddf39500c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..eb576cdebe5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -352,6 +352,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+RepackedRels "Waiting to read or update information on tables being repacked concurrently."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 559ba9cdb2c..4911642fb3c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
#include "catalog/pg_type.h"
#include "catalog/schemapg.h"
#include "catalog/storage.h"
+#include "commands/cluster.h"
#include "commands/policy.h"
#include "commands/publicationcmds.h"
#include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 70a6b8902d1..7f1c220e00b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
/* Prototypes for local functions */
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
static void SnapshotResetXmin(void);
/* ResourceOwner callbacks to track snapshot references */
@@ -646,7 +645,7 @@ CopySnapshot(Snapshot snapshot)
* FreeSnapshot
* Free the memory associated with a snapshot.
*/
-static void
+void
FreeSnapshot(Snapshot snapshot)
{
Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 2eee34cbfa3..49beb5e906e 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4929,18 +4929,27 @@ match_previous_words(int pattern_id,
}
/* REPACK */
- else if (Matches("REPACK"))
+ else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+ COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+ "CONCURRENTLY");
+ else if (Matches("REPACK", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- else if (Matches("REPACK", "(*)"))
+ else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
- /* If we have REPACK <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", MatchAnyExcept("(")))
+ /* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+ Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK (*) <sth>, then add "USING INDEX" */
- else if (Matches("REPACK", "(*)", MatchAny))
+ /* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+ else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+ Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
COMPLETE_WITH("USING INDEX");
- /* If we have REPACK <sth> USING, then add the index as well */
- else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+ /*
+ * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+ * indexes for <sth>.
+ */
+ else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
{
set_completion_reference(prev3_wd);
COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..b82dd17a966 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -323,14 +323,15 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
BulkInsertState bistate);
extern TM_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
- struct TM_FailureData *tmfd, bool changingPart);
+ struct TM_FailureData *tmfd, bool changingPart,
+ bool wal_logical);
extern void heap_finish_speculative(Relation relation, ItemPointer tid);
extern void heap_abort_speculative(Relation relation, ItemPointer tid);
extern TM_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
struct TM_FailureData *tmfd, LockTupleMode *lockmode,
- TU_UpdateIndexes *update_indexes);
+ TU_UpdateIndexes *update_indexes, bool wal_logical);
extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
bool follow_updates,
@@ -411,6 +412,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
TransactionId *dead_after);
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer);
extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
#define XLH_DELETE_IS_SUPER (1<<3)
#define XLH_DELETE_IS_PARTITION_MOVE (1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL (1<<5)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..289b64edfd9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
#include "access/xact.h"
#include "commands/vacuum.h"
#include "executor/tuptable.h"
+#include "replication/logical.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/snapshot.h"
@@ -623,6 +624,8 @@ typedef struct TableAmRoutine
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1627,6 +1630,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* not needed for the relation's AM
* - *xid_cutoff - ditto
* - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ * (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ * changes.
*
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1639,6 +1646,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
+ Snapshot snapshot,
+ LogicalDecodingContext *decoding_ctx,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
@@ -1647,6 +1656,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
{
OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
use_sort, OldestXmin,
+ snapshot, decoding_ctx,
xid_cutoff, multi_cutoff,
num_tuples, tups_vacuumed,
tups_recently_dead);
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..66431cc19e5 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -100,6 +100,9 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern NullableDatum *get_index_stattargets(Oid indexid,
+ IndexInfo *indInfo);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 3be57c97b3f..0a7e72bc74a 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
#ifndef CLUSTER_H
#define CLUSTER_H
+#include "nodes/execnodes.h"
#include "nodes/parsenodes.h"
#include "parser/parse_node.h"
+#include "replication/logical.h"
#include "storage/lock.h"
+#include "storage/relfilelocator.h"
#include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
/* flag bits for ClusterParams->options */
@@ -24,6 +29,7 @@
#define CLUOPT_RECHECK 0x02 /* recheck relation state */
#define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
* indisclustered */
+#define CLUOPT_CONCURRENT 0x08 /* allow concurrent data changes */
/* options for CLUSTER */
typedef struct ClusterParams
@@ -46,13 +52,89 @@ typedef enum ClusterCommand
CLUSTER_COMMAND_VACUUM
} ClusterCommand;
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+ CHANGE_INSERT,
+ CHANGE_UPDATE_OLD,
+ CHANGE_UPDATE_NEW,
+ CHANGE_DELETE,
+ CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+ /* See the enum above. */
+ ConcurrentChangeKind kind;
+
+ /*
+ * The actual tuple.
+ *
+ * The tuple data follows the ConcurrentChange structure. Before use make
+ * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+ * bytea) and that tuple->t_data is fixed.
+ */
+ HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+ sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+ /* The relation whose changes we're decoding. */
+ Oid relid;
+
+ /*
+ * Decoded changes are stored here. Although we try to avoid excessive
+ * batches, it can happen that the changes need to be stored to disk. The
+ * tuplestore does this transparently.
+ */
+ Tuplestorestate *tstore;
+
+ /* The current number of changes in tstore. */
+ double nchanges;
+
+ /*
+ * Descriptor to store the ConcurrentChange structure serialized (bytea).
+ * We can't store the tuple directly because tuplestore only supports
+ * minimum tuple and we may need to transfer OID system column from the
+ * output plugin. Also we need to transfer the change kind, so it's better
+ * to put everything in the structure than to use 2 tuplestores "in
+ * parallel".
+ */
+ TupleDesc tupdesc_change;
+
+ /* Tuple descriptor needed to update indexes. */
+ TupleDesc tupdesc;
+
+ /* Slot to retrieve data from tstore. */
+ TupleTableSlot *tsslot;
+
+ ResourceOwner resowner;
+} RepackDecodingState;
+
extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
- ClusterCommand cmd);
+ ClusterCommand cmd, bool isTopLevel);
extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
-
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+ XLogRecPtr end_of_wal);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -60,6 +142,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
bool swap_toast_by_content,
bool check_constraints,
bool is_internal,
+ bool reindex,
TransactionId frozenXid,
MultiXactId cutoffMulti,
char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index f92ff524031..4cbf4d16529 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,18 +59,20 @@
/*
* Progress parameters for REPACK.
*
- * Note: Since REPACK shares some code with CLUSTER, these values are also
- * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
- * introduce a separate set of constants.)
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes little
+ * sense to introduce a separate set of constants.)
*/
#define PROGRESS_REPACK_COMMAND 0
#define PROGRESS_REPACK_PHASE 1
#define PROGRESS_REPACK_INDEX_RELID 2
#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED 3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN 4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED 4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED 5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED 6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS 7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED 8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT 9
/*
* Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -83,9 +85,10 @@
#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP 2
#define PROGRESS_REPACK_PHASE_SORT_TUPLES 3
#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP 4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 7
+#define PROGRESS_REPACK_PHASE_CATCH_UP 5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES 6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX 7
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP 8
/*
* Commands of PROGRESS_REPACK
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 52584bd8dbf..a2986abee97 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3938,6 +3938,7 @@ typedef struct RepackStmt
RangeVar *relation; /* relation being repacked */
char *indexname; /* order tuples by this index */
List *params; /* list of DefElem nodes */
+ bool concurrent; /* allow concurrent access? */
} RepackStmt;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
extern void SnapBuildSnapDecRefcount(Snapshot snap);
extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL), ANALYZE, CREATE
- * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4 /* VACUUM (non-exclusive), ANALYZE, CREATE
+ * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW
* SHARE */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..9bb2f7ae1a8 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 147b190210a..5eeabdc6c4f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -61,6 +61,8 @@ extern Snapshot GetLatestSnapshot(void);
extern void SnapshotSetCommandId(CommandId curcid);
extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
extern Snapshot GetCatalogSnapshot(Oid relid);
extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 328235044d9..ebaf8fdd268 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1990,17 +1990,17 @@ pg_stat_progress_cluster| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS cluster_index_relid,
s.param4 AS heap_tuples_scanned,
s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_copy| SELECT s.pid,
@@ -2072,17 +2072,20 @@ pg_stat_progress_repack| SELECT s.pid,
WHEN 2 THEN 'index scanning heap'::text
WHEN 3 THEN 'sorting tuples'::text
WHEN 4 THEN 'writing new heap'::text
- WHEN 5 THEN 'swapping relation files'::text
- WHEN 6 THEN 'rebuilding index'::text
- WHEN 7 THEN 'performing final cleanup'::text
+ WHEN 5 THEN 'catch-up'::text
+ WHEN 6 THEN 'swapping relation files'::text
+ WHEN 7 THEN 'rebuilding index'::text
+ WHEN 8 THEN 'performing final cleanup'::text
ELSE NULL::text
END AS phase,
(s.param3)::oid AS repack_index_relid,
s.param4 AS heap_tuples_scanned,
- s.param5 AS heap_tuples_written,
- s.param6 AS heap_blks_total,
- s.param7 AS heap_blks_scanned,
- s.param8 AS index_rebuild_count
+ s.param5 AS heap_tuples_inserted,
+ s.param6 AS heap_tuples_updated,
+ s.param7 AS heap_tuples_deleted,
+ s.param8 AS heap_blks_total,
+ s.param9 AS heap_blks_scanned,
+ s.param10 AS index_rebuild_count
FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
LEFT JOIN pg_database d ON ((s.datid = d.oid)));
pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 255d0e76520..879977ea41f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -488,6 +488,8 @@ CompressFileHandle
CompressionLocation
CompressorState
ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
ConditionVariable
ConditionVariableMinimallyPadded
ConditionalStack
@@ -1259,6 +1261,7 @@ IndexElem
IndexFetchHeapData
IndexFetchTableData
IndexInfo
+IndexInsertState
IndexList
IndexOnlyScan
IndexOnlyScanState
@@ -2529,6 +2532,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
--
2.47.1
v15-0005-Add-regression-tests.patchtext/x-diffDownload
From 9345c800de521bc1ed79921c9dec25b40ae01f28 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:43 +0200
Subject: [PATCH 5/7] Add regression tests.
As this patch series adds the CONCURRENTLY option to the REPACK command, it's
appropriate to test that the "concurrent data changes" (i.e. changes done by
application while we are copying the table contents to the new storage) are
processed correctly.
Injection points are used to stop the data copying at some point. While the
backend in charge of the copying is waiting on the injection point, another
backend runs some INSERT, UPDATE and DELETE commands on the table. Then we
wake up the first backend and let the REPACK CONCURRENTLY command
finish. Finally we check that all the "concurrent data changes" are present in
the table and that they contain the correct visibility information.
---
src/backend/commands/cluster.c | 7 +
src/test/modules/injection_points/Makefile | 3 +-
.../injection_points/expected/repack.out | 113 ++++++++++++++
.../modules/injection_points/logical.conf | 1 +
src/test/modules/injection_points/meson.build | 4 +
.../injection_points/specs/repack.spec | 143 ++++++++++++++++++
6 files changed, 270 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/repack.out
create mode 100644 src/test/modules/injection_points/logical.conf
create mode 100644 src/test/modules/injection_points/specs/repack.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 408bdbdff3b..abbbfc99036 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
#include "utils/acl.h"
#include "utils/fmgroids.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -3006,6 +3007,12 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+ /*
+ * During testing, wait for another backend to perform concurrent data
+ * changes which we will process below.
+ */
+ INJECTION_POINT("repack-concurrently-before-lock", NULL);
+
/*
* Flush all WAL records inserted so far (possibly except for the last
* incomplete page, see GetInsertRecPtr), to minimize the amount of data
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..30ffe509239 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc vacuum
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..f919087ca5b
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_before_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing:
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+
+step change_new:
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1:
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+
+step change_subxact2:
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+
+step check2:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock:
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1:
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+ 2
+(1 row)
+
+ i| j
+---+---
+ 2| 20
+ 6| 60
+ 8| 8
+ 10| 1
+ 40| 3
+ 50| 5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+ 0
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index ce778ccf9ac..c7daa669548 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,9 +47,13 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'repack',
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
+ # 'repack' requires wal_level = 'logical'.
+ 'regress_args': ['--temp-config', files('logical.conf')],
+
},
'tap': {
'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..a17064462ce
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,143 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+ CREATE EXTENSION injection_points;
+
+ CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+ INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+ CREATE TABLE relfilenodes(node oid);
+
+ CREATE TABLE data_s1(i int, j int);
+ CREATE TABLE data_s2(i int, j int);
+}
+
+teardown
+{
+ DROP TABLE repack_test;
+ DROP EXTENSION injection_points;
+
+ DROP TABLE relfilenodes;
+ DROP TABLE data_s1;
+ DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT count(DISTINCT node) FROM relfilenodes;
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s1(i, j)
+ SELECT i, j FROM repack_test;
+
+ SELECT count(*)
+ FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+ WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+ UPDATE repack_test SET i=10 where i=1;
+ UPDATE repack_test SET j=20 where i=2;
+ UPDATE repack_test SET i=30 where i=3;
+ UPDATE repack_test SET i=40 where i=30;
+ DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+ INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+ UPDATE repack_test SET i=50 where i=5;
+ UPDATE repack_test SET j=60 where i=6;
+ DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+#
+# XXX Not sure this test is useful now - it was designed for the patch that
+# preserves tuple visibility and which therefore modifies
+# TransactionIdIsCurrentTransactionId().
+step change_subxact1
+{
+ BEGIN;
+ INSERT INTO repack_test(i, j) VALUES (100, 100);
+ SAVEPOINT s1;
+ UPDATE repack_test SET i=101 where i=100;
+ SAVEPOINT s2;
+ UPDATE repack_test SET i=102 where i=101;
+ COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+#
+# XXX Is this test useful? See above.
+step change_subxact2
+{
+ BEGIN;
+ SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 110);
+ ROLLBACK TO SAVEPOINT s1;
+ INSERT INTO repack_test(i, j) VALUES (110, 111);
+ COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+ INSERT INTO relfilenodes(node)
+ SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+ SELECT i, j FROM repack_test ORDER BY i, j;
+
+ INSERT INTO data_s2(i, j)
+ SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+ wait_before_lock
+ change_existing
+ change_new
+ change_subxact1
+ change_subxact2
+ check2
+ wakeup_before_lock
+ check1
--
2.47.1
v15-0006-Introduce-repack_max_xlock_time-configuration-variab.patchtext/x-diffDownload
From 327fb689e01d1be5e4ee3e4564d9e659a97f5e18 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:43 +0200
Subject: [PATCH 6/7] Introduce repack_max_xlock_time configuration variable.
When executing REPACK CONCURRENTLY, we need the AccessExclusiveLock to swap
the relation files and that should require pretty short time. However, on a
busy system, other backends might change non-negligible amount of data in the
table while we are waiting for the lock. Since these changes must be applied
to the new storage before the swap, the time we eventually hold the lock might
become non-negligible too.
If the user is worried about this situation, he can set repack_max_xlock_time
to the maximum time for which the exclusive lock may be held. If this amount
of time is not sufficient to complete the REPACK CONCURRENTLY command, ERROR
is raised and the command is canceled.
---
doc/src/sgml/config.sgml | 31 ++++
doc/src/sgml/ref/repack.sgml | 5 +-
src/backend/access/heap/heapam_handler.c | 3 +-
src/backend/commands/cluster.c | 135 +++++++++++++++---
src/backend/utils/misc/guc_tables.c | 15 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/cluster.h | 5 +-
.../injection_points/expected/repack.out | 74 +++++++++-
.../injection_points/specs/repack.spec | 42 ++++++
9 files changed, 290 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 59a0874528a..c0529005c78 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11239,6 +11239,37 @@ dynamic_library_path = '/usr/local/lib/postgresql:$libdir'
</listitem>
</varlistentry>
+ <varlistentry id="guc-repack-max-xclock-time" xreflabel="repack_max_xlock_time">
+ <term><varname>repack_max_xlock_time</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>repack_max_xlock_time</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This is the maximum amount of time to hold an exclusive lock on a
+ table by <command>REPACK</command> with
+ the <literal>CONCURRENTLY</literal> option. Typically, these commands
+ should not need the lock for longer time
+ than <command>TRUNCATE</command> does. However, additional time might
+ be needed if the system is too busy. (See <xref linkend="sql-repack"/>
+ for explanation how the <literal>CONCURRENTLY</literal> option works.)
+ </para>
+
+ <para>
+ If you want to restrict the lock time, set this variable to the
+ highest acceptable value. If it appears during the processing that
+ additional time is needed to release the lock, the command will be
+ cancelled.
+ </para>
+
+ <para>
+ The default value is 0, which means that the lock is not released
+ until the concurrent data changes are processed.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 9c089a6b3d7..e1313f40599 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -192,7 +192,10 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
too many data changes have been done to the table while
<command>REPACK</command> was waiting for the lock: those changes must
be processed just before the files are swapped, while the
- <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+ <literal>ACCESS EXCLUSIVE</literal> lock is being held. If you are
+ worried about this situation, set
+ the <link linkend="guc-repack-max-xclock-time"><varname>repack_max_xlock_time</varname></link>
+ configuration parameter to a value that your applications can tolerate.
</para>
<para>
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c829c06f769..03e722347a1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -986,7 +986,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
end_of_wal = GetFlushRecPtr(NULL);
if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
{
- repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+ repack_decode_concurrent_changes(decoding_ctx, end_of_wal,
+ NULL);
end_of_wal_prev = end_of_wal;
}
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index abbbfc99036..37f69f369eb 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,8 @@
*/
#include "postgres.h"
+#include <sys/time.h>
+
#include "access/amapi.h"
#include "access/heapam.h"
#include "access/multixact.h"
@@ -89,6 +91,15 @@ typedef struct
RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+/*
+ * The maximum time to hold AccessExclusiveLock during the final
+ * processing. Note that only the execution time of
+ * process_concurrent_changes() is included here. The very last steps like
+ * swap_relation_files() shouldn't get blocked and it'd be wrong to consider
+ * them a reason to abort otherwise completed processing.
+ */
+int repack_max_xlock_time = 0;
+
/*
* Everything we need to call ExecInsertIndexTuples().
*/
@@ -132,7 +143,8 @@ static LogicalDecodingContext *setup_logical_decoding(Oid relid,
static HeapTuple get_changed_tuple(char *change);
static void apply_concurrent_changes(RepackDecodingState *dstate,
Relation rel, ScanKey key, int nkeys,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
HeapTuple tup, IndexInsertState *iistate,
TupleTableSlot *index_slot);
@@ -148,13 +160,15 @@ static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
IndexInsertState *iistate,
TupleTableSlot *ident_slot,
IndexScanDesc *scan_p);
-static void process_concurrent_changes(LogicalDecodingContext *ctx,
+static bool process_concurrent_changes(LogicalDecodingContext *ctx,
XLogRecPtr end_of_wal,
Relation rel_dst,
Relation rel_src,
ScanKey ident_key,
int ident_key_nentries,
- IndexInsertState *iistate);
+ IndexInsertState *iistate,
+ struct timeval *must_complete);
+static bool processing_time_elapsed(struct timeval *must_complete);
static IndexInsertState *get_index_insert_state(Relation relation,
Oid ident_index_id);
static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
@@ -2352,7 +2366,8 @@ get_changed_tuple(char *change)
*/
void
repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal)
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
ResourceOwner resowner_old;
@@ -2382,6 +2397,9 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
if (record != NULL)
LogicalDecodingProcessRecord(ctx, ctx->reader);
+ if (processing_time_elapsed(must_complete))
+ break;
+
/*
* If WAL segment boundary has been crossed, inform the decoding
* system that the catalog_xmin can advance. (We can confirm more
@@ -2422,7 +2440,8 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
*/
static void
apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
- ScanKey key, int nkeys, IndexInsertState *iistate)
+ ScanKey key, int nkeys, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
TupleTableSlot *index_slot,
*ident_slot;
@@ -2452,6 +2471,9 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
CHECK_FOR_INTERRUPTS();
+ Assert(dstate->nchanges > 0);
+ dstate->nchanges--;
+
/* Get the change from the single-column tuple. */
tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
@@ -2552,10 +2574,22 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
Assert(shouldFree);
pfree(tup_change);
+
+ /*
+ * If there is a limit on the time of completion, check it now.
+ * However, make sure the loop does not break if tup_old was set in
+ * the previous iteration. In such a case we could not resume the
+ * processing in the next call.
+ */
+ if (must_complete && tup_old == NULL &&
+ processing_time_elapsed(must_complete))
+ /* The next call will process the remaining changes. */
+ break;
}
- tuplestore_clear(dstate->tstore);
- dstate->nchanges = 0;
+ /* If we could not apply all the changes, the next call will do. */
+ if (dstate->nchanges == 0)
+ tuplestore_clear(dstate->tstore);
/* Cleanup. */
ExecDropSingleTupleTableSlot(index_slot);
@@ -2737,11 +2771,15 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
* Decode and apply concurrent changes.
*
* Pass rel_src iff its reltoastrelid is needed.
+ *
+ * Returns true if must_complete is NULL or if managed to complete by the time
+ * *must_complete indicates.
*/
-static void
+static bool
process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
Relation rel_dst, Relation rel_src, ScanKey ident_key,
- int ident_key_nentries, IndexInsertState *iistate)
+ int ident_key_nentries, IndexInsertState *iistate,
+ struct timeval *must_complete)
{
RepackDecodingState *dstate;
@@ -2750,10 +2788,19 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
dstate = (RepackDecodingState *) ctx->output_writer_private;
- repack_decode_concurrent_changes(ctx, end_of_wal);
+ repack_decode_concurrent_changes(ctx, end_of_wal, must_complete);
+ if (processing_time_elapsed(must_complete))
+ /* Caller is responsible for applying the changes. */
+ return false;
+
+ /*
+ * *must_complete not reached, so there are really no changes. (It's
+ * possible to see no changes just because not enough time was left for
+ * the decoding.)
+ */
if (dstate->nchanges == 0)
- return;
+ return true;
PG_TRY();
{
@@ -2765,7 +2812,7 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
apply_concurrent_changes(dstate, rel_dst, ident_key,
- ident_key_nentries, iistate);
+ ident_key_nentries, iistate, must_complete);
}
PG_FINALLY();
{
@@ -2773,6 +2820,28 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
rel_dst->rd_toastoid = InvalidOid;
}
PG_END_TRY();
+
+ /*
+ * apply_concurrent_changes() does check the processing time, so if some
+ * changes are left, we ran out of time.
+ */
+ return dstate->nchanges == 0;
+}
+
+/*
+ * Check if the current time is beyond *must_complete.
+ */
+static bool
+processing_time_elapsed(struct timeval *must_complete)
+{
+ struct timeval now;
+
+ if (must_complete == NULL)
+ return false;
+
+ gettimeofday(&now, NULL);
+
+ return timercmp(&now, must_complete, >);
}
static IndexInsertState *
@@ -2934,6 +3003,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
Relation *ind_refs,
*ind_refs_p;
int nind;
+ struct timeval t_end;
+ struct timeval *t_end_ptr = NULL;
/* Like in cluster_rel(). */
lockmode_old = ShareUpdateExclusiveLock;
@@ -3029,7 +3100,8 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
*/
process_concurrent_changes(ctx, end_of_wal, NewHeap,
swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+ ident_key, ident_key_nentries, iistate,
+ NULL);
/*
* Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3125,9 +3197,40 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
end_of_wal = GetFlushRecPtr(NULL);
/* Apply the concurrent changes again. */
- process_concurrent_changes(ctx, end_of_wal, NewHeap,
- swap_toast_by_content ? OldHeap : NULL,
- ident_key, ident_key_nentries, iistate);
+
+ /*
+ * This time we have the exclusive lock on the table, so make sure that
+ * repack_max_xlock_time is not exceeded.
+ */
+ if (repack_max_xlock_time > 0)
+ {
+ int64 usec;
+ struct timeval t_start;
+
+ gettimeofday(&t_start, NULL);
+ /* Add the whole seconds. */
+ t_end.tv_sec = t_start.tv_sec + repack_max_xlock_time / 1000;
+ /* Add the rest, expressed in microseconds. */
+ usec = t_start.tv_usec + 1000 * (repack_max_xlock_time % 1000);
+ /* The number of microseconds could have overflown. */
+ t_end.tv_sec += usec / USECS_PER_SEC;
+ t_end.tv_usec = usec % USECS_PER_SEC;
+ t_end_ptr = &t_end;
+ }
+
+ /*
+ * During testing, stop here to simulate excessive processing time.
+ */
+ INJECTION_POINT("repack-concurrently-after-lock", NULL);
+
+ if (!process_concurrent_changes(ctx, end_of_wal, NewHeap,
+ swap_toast_by_content ? OldHeap : NULL,
+ ident_key, ident_key_nentries, iistate,
+ t_end_ptr))
+ ereport(ERROR,
+ (errmsg("could not process concurrent data changes in time"),
+ errhint("Please consider adjusting \"repack_max_xlock_time\".")));
+
/* Remember info about rel before closing OldHeap */
relpersistence = OldHeap->rd_rel->relpersistence;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..6a373a8b65a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -42,8 +42,9 @@
#include "catalog/namespace.h"
#include "catalog/storage.h"
#include "commands/async.h"
-#include "commands/extension.h"
+#include "commands/cluster.h"
#include "commands/event_trigger.h"
+#include "commands/extension.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -2839,6 +2840,18 @@ struct config_int ConfigureNamesInt[] =
1600000000, 0, 2100000000,
NULL, NULL, NULL
},
+ {
+ {"repack_max_xlock_time", PGC_USERSET, LOCK_MANAGEMENT,
+ gettext_noop("Maximum time for REPACK CONCURRENTLY to keep table locked."),
+ gettext_noop("The table is locked in exclusive mode during the final stage of processing. "
+ "If the lock time exceeds this value, error is raised and the lock is "
+ "released. Set to zero if you don't care how long the lock can be held."),
+ GUC_UNIT_MS
+ },
+ &repack_max_xlock_time,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
/*
* See also CheckRequiredParameterValues() if this parameter changes
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..42d32a2c198 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -765,6 +765,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#repack_max_xlock_time = 0
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 0a7e72bc74a..4914f217267 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,8 @@ typedef enum ClusterCommand
extern RelFileLocator repacked_rel_locator;
extern RelFileLocator repacked_rel_toast_locator;
+extern PGDLLIMPORT int repack_max_xlock_time;
+
typedef enum
{
CHANGE_INSERT,
@@ -134,7 +136,8 @@ extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
LOCKMODE lockmode);
extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
- XLogRecPtr end_of_wal);
+ XLogRecPtr end_of_wal,
+ struct timeval *must_complete);
extern Oid make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
char relpersistence, LOCKMODE lockmode);
extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
index f919087ca5b..02967ed9d48 100644
--- a/src/test/modules/injection_points/expected/repack.out
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -1,4 +1,4 @@
-Parsed test spec with 2 sessions
+Parsed test spec with 4 sessions
starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
injection_points_attach
@@ -111,3 +111,75 @@ injection_points_detach
(1 row)
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+
+starting permutation: wait_after_lock after_lock_delay wakeup_after_lock
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step wait_after_lock:
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step after_lock_delay:
+ SELECT pg_sleep(1.5);
+
+pg_sleep
+--------
+
+(1 row)
+
+step wakeup_after_lock:
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step wait_after_lock: <... completed>
+ERROR: could not process concurrent data changes in time
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index a17064462ce..d0fa38dd8cd 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -130,6 +130,34 @@ step wakeup_before_lock
SELECT injection_points_wakeup('repack-concurrently-before-lock');
}
+session s3
+setup
+{
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('repack-concurrently-after-lock', 'wait');
+ SET repack_max_xlock_time TO '1s';
+}
+# Perform the initial load, lock the table in exclusive mode and wait. s4 will
+# cancel the waiting.
+step wait_after_lock
+{
+ REPACK CONCURRENTLY repack_test USING INDEX repack_test_pkey;
+}
+teardown
+{
+ SELECT injection_points_detach('repack-concurrently-after-lock');
+}
+
+session s4
+step wakeup_after_lock
+{
+ SELECT injection_points_wakeup('repack-concurrently-after-lock');
+}
+step after_lock_delay
+{
+ SELECT pg_sleep(1.5);
+}
+
# Test if data changes introduced while one session is performing REPACK
# CONCURRENTLY find their way into the table.
permutation
@@ -141,3 +169,17 @@ permutation
check2
wakeup_before_lock
check1
+
+# Test the repack_max_xlock_time configuration variable.
+#
+# First, cancel waiting on the injection point immediately. That way, REPACK
+# should complete.
+permutation
+ wait_after_lock
+ wakeup_after_lock
+# Second, cancel the waiting with a delay that violates
+# repack_max_xlock_time.
+permutation
+ wait_after_lock
+ after_lock_delay
+ wakeup_after_lock
--
2.47.1
v15-0007-Enable-logical-decoding-transiently-only-for-REPACK-.patchtext/x-diffDownload
From 9677b13211b05c5893a80404aa3a8d383833b9a3 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 30 Jun 2025 19:41:43 +0200
Subject: [PATCH 7/7] Enable logical decoding transiently, only for REPACK
CONCURRENTLY.
As REPACK CONCURRENTLY uses logical decoding, it requires wal_level to be set
to 'logical', while 'replica' is the default value. If logical replication is
not used, users will probably be reluctant to set the GUC to 'logical' because
it can affect server performance (by writing additional information to WAL)
and because it cannot be changed to 'logical' only for the time REPACK
CONCURRENTLY is running: change of this GUC requires server restart to take
effect.
This patch teaches postgres backend to recognize whether it should consider
wal_level='logical' "locally" for particular transaction, even if the
wal_level GUC is actually set to 'replica'. Also it ensures that the logical
decoding specific information is added to WAL only for the tables which are
currently being processed by REPACK CONCURRENTLY.
If the logical decoding is enabled this way, only temporary replication slots
should be created. The problem of permanent slot is that it is restored during
server restart, and the restore fails if wal_level is not "globally"
'logical'.
There is an independent work in progres to enable logical decoding transiently
[1]. ISTM that this is too "heavyweight" solution for our problem. And I think
that these two approaches are not mutually exclusive: once [1] is committed,
we only need to adjust the XLogLogicalInfoActive() macro.
[1] https://www.postgresql.org/message-id/CAD21AoCVLeLYq09pQPaWs%2BJwdni5FuJ8v2jgq-u9_uFbcp6UbA%40mail.gmail.com
---
doc/src/sgml/ref/repack.sgml | 7 -
src/backend/access/transam/parallel.c | 8 +
src/backend/access/transam/xact.c | 106 ++++-
src/backend/access/transam/xlog.c | 1 +
src/backend/commands/cluster.c | 387 +++++++++++++++++-
src/backend/replication/logical/logical.c | 9 +-
src/backend/storage/ipc/ipci.c | 2 +
src/backend/storage/ipc/standby.c | 4 +-
src/backend/utils/cache/inval.c | 21 +
src/backend/utils/cache/relcache.c | 4 +
src/include/access/xlog.h | 15 +-
src/include/commands/cluster.h | 5 +
src/include/utils/inval.h | 2 +
src/include/utils/rel.h | 9 +-
src/test/modules/injection_points/Makefile | 1 -
.../modules/injection_points/logical.conf | 1 -
src/test/modules/injection_points/meson.build | 3 -
src/tools/pgindent/typedefs.list | 1 +
18 files changed, 540 insertions(+), 46 deletions(-)
delete mode 100644 src/test/modules/injection_points/logical.conf
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index e1313f40599..0fd767eef98 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -260,13 +260,6 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] CONCU
</para>
</listitem>
- <listitem>
- <para>
- The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
- configuration parameter is less than <literal>logical</literal>.
- </para>
- </listitem>
-
<listitem>
<para>
The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..a33318ea7bd 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -98,6 +98,7 @@ typedef struct FixedParallelState
TimestampTz xact_ts;
TimestampTz stmt_ts;
SerializableXactHandle serializable_xact_handle;
+ int wal_level_transient;
/* Mutex protects remaining fields. */
slock_t mutex;
@@ -355,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
fps->xact_ts = GetCurrentTransactionStartTimestamp();
fps->stmt_ts = GetCurrentStatementStartTimestamp();
fps->serializable_xact_handle = ShareSerializableXact();
+ fps->wal_level_transient = wal_level_transient;
SpinLockInit(&fps->mutex);
fps->last_xlog_end = 0;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_FIXED, fps);
@@ -1550,6 +1552,12 @@ ParallelWorkerMain(Datum main_arg)
/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
AttachSerializableXact(fps->serializable_xact_handle);
+ /*
+ * Restore the information whether this worker should behave as if
+ * wal_level was WAL_LEVEL_LOGICAL..
+ */
+ wal_level_transient = fps->wal_level_transient;
+
/*
* We've initialized all of our state now; nothing should change
* hereafter.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f2de587a1..be568f70961 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,6 +36,7 @@
#include "catalog/pg_enum.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "common/pg_prng.h"
@@ -126,6 +127,12 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
static int nParallelCurrentXids = 0;
static TransactionId *ParallelCurrentXids;
+/*
+ * Have we determined the value of wal_level_transient for the current
+ * transaction?
+ */
+static bool wal_level_transient_checked = false;
+
/*
* Miscellaneous flag bits to record events which occur on the top level
* transaction. These flags are only persisted in MyXactFlags and are intended
@@ -638,6 +645,7 @@ AssignTransactionId(TransactionState s)
bool isSubXact = (s->parent != NULL);
ResourceOwner currentOwner;
bool log_unknown_top = false;
+ bool set_wal_level_transient = false;
/* Assert that caller didn't screw up */
Assert(!FullTransactionIdIsValid(s->fullTransactionId));
@@ -652,6 +660,32 @@ AssignTransactionId(TransactionState s)
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot assign transaction IDs during a parallel operation")));
+ /*
+ * The first call (i.e. the first write) in the transaction tree
+ * determines whether the whole transaction assumes logical decoding or
+ * not.
+ */
+ if (!wal_level_transient_checked)
+ {
+ Assert(wal_level_transient == WAL_LEVEL_MINIMAL);
+
+ /*
+ * Do not repeat the check when calling this function for parent
+ * transactions.
+ */
+ wal_level_transient_checked = true;
+
+ /*
+ * Remember that the actual check is needed. We cannot do it until the
+ * top-level transaction has its XID assigned, see comments below.
+ *
+ * There is no use case for overriding MINIMAL, and LOGICAL cannot be
+ * overridden as such.
+ */
+ if (wal_level == WAL_LEVEL_REPLICA)
+ set_wal_level_transient = true;
+ }
+
/*
* Ensure parent(s) have XIDs, so that a child always has an XID later
* than its parent. Mustn't recurse here, or we might get a stack
@@ -681,20 +715,6 @@ AssignTransactionId(TransactionState s)
pfree(parents);
}
- /*
- * When wal_level=logical, guarantee that a subtransaction's xid can only
- * be seen in the WAL stream if its toplevel xid has been logged before.
- * If necessary we log an xact_assignment record with fewer than
- * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
- * for a transaction even though it appears in a WAL record, we just might
- * superfluously log something. That can happen when an xid is included
- * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
- * xl_standby_locks.
- */
- if (isSubXact && XLogLogicalInfoActive() &&
- !TopTransactionStateData.didLogXid)
- log_unknown_top = true;
-
/*
* Generate a new FullTransactionId and record its xid in PGPROC and
* pg_subtrans.
@@ -719,6 +739,54 @@ AssignTransactionId(TransactionState s)
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
+ /*
+ * Check if this transaction should consider wal_level=logical.
+ *
+ * Sometimes we need to turn on the logical decoding transiently although
+ * wal_level=WAL_LEVEL_REPLICA. Currently we do so when at least one table
+ * is being clustered concurrently, i.e. when we should assume that
+ * changes done by this transaction will be decoded. In such a case we
+ * adjust the value of XLogLogicalInfoActive() by setting
+ * wal_level_transient to LOGICAL.
+ *
+ * It's important not to do this check until the XID of the top-level
+ * transaction is in ProcGlobal: if the decoding becomes mandatory right
+ * after the check, our transaction will fail to write the necessary
+ * information to WAL. However, if the top-level transaction is already in
+ * ProcGlobal, its XID is guaranteed to appear in the xl_running_xacts
+ * record and therefore the snapshot builder will not try to decode the
+ * transaction (because it assumes it could have missed the initial part
+ * of the transaction).
+ *
+ * On the other hand, if the decoding became mandatory between the actual
+ * XID assignment and now, the transaction will WAL the decoding specific
+ * information unnecessarily. Let's assume that such race conditions do
+ * not happen too often.
+ */
+ if (set_wal_level_transient)
+ {
+ /*
+ * Check for the operation that enables the logical decoding
+ * transiently.
+ */
+ if (is_concurrent_repack_in_progress(InvalidOid))
+ wal_level_transient = WAL_LEVEL_LOGICAL;
+ }
+
+ /*
+ * When wal_level=logical, guarantee that a subtransaction's xid can only
+ * be seen in the WAL stream if its toplevel xid has been logged before.
+ * If necessary we log an xact_assignment record with fewer than
+ * PGPROC_MAX_CACHED_SUBXIDS. Note that it is fine if didLogXid isn't set
+ * for a transaction even though it appears in a WAL record, we just might
+ * superfluously log something. That can happen when an xid is included
+ * somewhere inside a wal record, but not in XLogRecord->xl_xid, like in
+ * xl_standby_locks.
+ */
+ if (isSubXact && XLogLogicalInfoActive() &&
+ !TopTransactionStateData.didLogXid)
+ log_unknown_top = true;
+
/*
* Acquire lock on the transaction XID. (We assume this cannot block.) We
* have to ensure that the lock is assigned to the transaction's own
@@ -2216,6 +2284,16 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * wal_level_transient can override wal_level for individual transactions,
+ * which effectively enables logical decoding for them. At the moment we
+ * don't know if this transaction will write any data changes to be
+ * decoded. Should it do, AssignTransactionId() will check if the decoding
+ * needs to be considered.
+ */
+ wal_level_transient = WAL_LEVEL_MINIMAL;
+ wal_level_transient_checked = false;
+
ShowTransactionState("StartTransaction");
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47ffc0a2307..dc222db6a5d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -129,6 +129,7 @@ bool wal_recycle = true;
bool log_checkpoints = true;
int wal_sync_method = DEFAULT_WAL_SYNC_METHOD;
int wal_level = WAL_LEVEL_REPLICA;
+int wal_level_transient = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 37f69f369eb..7ecef2b86fc 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -84,6 +84,14 @@ typedef struct
* The following definitions are used for concurrent processing.
*/
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid repacked_rel = InvalidOid;
+
+/* The same for its TOAST relation. */
+static Oid repacked_rel_toast = InvalidOid;
+
/*
* The locators are used to avoid logical decoding of data that we do not need
* for our table.
@@ -135,8 +143,10 @@ static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
ClusterCommand cmd);
static bool cluster_is_permitted_for_relation(Oid relid, Oid userid,
ClusterCommand cmd);
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+ bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
static LogicalDecodingContext *setup_logical_decoding(Oid relid,
const char *slotname,
TupleDesc tupdesc);
@@ -383,6 +393,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
Relation index;
bool concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
LOCKMODE lmode;
+ bool entered,
+ success;
/*
* Check that the correct lock is held. The lock mode is
@@ -558,23 +570,31 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params,
TransferPredicateLocksToHeapRelation(OldHeap);
/* rebuild_relation does all the dirty work */
+ entered = false;
+ success = false;
PG_TRY();
{
/*
- * For concurrent processing, make sure that our logical decoding
- * ignores data changes of other tables than the one we are
- * processing.
+ * For concurrent processing, make sure that
+ *
+ * 1) our logical decoding ignores data changes of other tables than
+ * the one we are processing.
+ *
+ * 2) other transactions know that REPACK CONCURRENTLY is in progress
+ * for our table, so they write sufficient information to WAL even if
+ * wal_level is < LOGICAL.
*/
if (concurrent)
- begin_concurrent_repack(OldHeap);
+ begin_concurrent_repack(OldHeap, &index, &entered);
rebuild_relation(OldHeap, index, verbose, concurrent, save_userid,
cmd);
+ success = true;
}
PG_FINALLY();
{
- if (concurrent)
- end_concurrent_repack();
+ if (concurrent && entered)
+ end_concurrent_repack(!success);
}
PG_END_TRY();
@@ -2208,6 +2228,49 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
#define REPL_PLUGIN_NAME "pgoutput_repack"
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRelsHash hashtable.
+ */
+typedef struct RepackedRel
+{
+ Oid relid;
+ Oid dbid;
+} RepackedRel;
+
+/* Hashtable of RepackedRel elements. */
+static HTAB *repackedRelsHash = NULL;;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable. Multiply by two because TOAST relations
+ * also need to be added to the hashtable.
+ */
+#define MAX_REPACKED_RELS (max_replication_slots * 2)
+
+Size
+RepackShmemSize(void)
+{
+ return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+ HASHCTL info;
+
+ info.keysize = sizeof(RepackedRel);
+ info.entrysize = info.keysize;
+ repackedRelsHash = ShmemInitHash("Repacked Relations Hash",
+ MAX_REPACKED_RELS,
+ MAX_REPACKED_RELS,
+ &info,
+ HASH_ELEM | HASH_BLOBS |
+ HASH_FIXED_SIZE);
+}
+
/*
* Call this function before REPACK CONCURRENTLY starts to setup logical
* decoding. It makes sure that other users of the table put enough
@@ -2222,11 +2285,150 @@ cluster_is_permitted_for_relation(Oid relid, Oid userid, ClusterCommand cmd)
*
* Note that TOAST table needs no attention here as it's not scanned using
* historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into repackedRelsHash or not.
*/
static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
{
- Oid toastrelid;
+ Oid relid,
+ toastrelid;
+ Relation index = NULL;
+ Oid indexid = InvalidOid;
+ RepackedRel key,
+ *entry;
+ bool found;
+ static bool before_shmem_exit_callback_setup = false;
+
+ relid = RelationGetRelid(rel);
+ index = index_p ? *index_p : NULL;
+
+ /*
+ * Make sure that we do not leave an entry in repackedRelsHash if exiting
+ * due to FATAL.
+ */
+ if (!before_shmem_exit_callback_setup)
+ {
+ before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+ before_shmem_exit_callback_setup = true;
+ }
+
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ *entered_p = false;
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+ {
+ /*
+ * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+ * should occur much earlier. However that lock may be released
+ * temporarily, see below. Anyway, we should complain whatever the
+ * reason of the conflict might be.
+ */
+ ereport(ERROR,
+ (errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ }
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ /*
+ * Even if the insertion of TOAST relid should fail below, the caller has
+ * to do cleanup.
+ */
+ *entered_p = true;
+
+ /*
+ * Enable the callback to remove the entry in case of exit. We should not
+ * do this earlier, otherwise an attempt to insert already existing entry
+ * could make us remove that entry (inserted by another backend) during
+ * ERROR handling.
+ */
+ Assert(!OidIsValid(repacked_rel));
+ repacked_rel = relid;
+
+ /*
+ * TOAST relation is not accessed using historic snapshot, but we enter it
+ * here to protect it from being VACUUMed by another backend. (Lock does
+ * not help in the CONCURRENTLY case because cannot hold it continuously
+ * till the end of the transaction.) See the comments on locking TOAST
+ * relation in copy_table_data().
+ */
+ toastrelid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toastrelid))
+ {
+ key.relid = toastrelid;
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_ENTER_NULL, &found);
+ if (found)
+
+ /*
+ * If we could enter the main fork the TOAST should succeed too.
+ * Nevertheless, check.
+ */
+ ereport(ERROR,
+ (errmsg("TOAST relation of \"%s\" is already being processed by REPACK CONCURRENTLY",
+ RelationGetRelationName(rel))));
+ if (entry == NULL)
+ ereport(ERROR,
+ (errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+ (errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+ Assert(!OidIsValid(repacked_rel_toast));
+ repacked_rel_toast = toastrelid;
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make sure that other backends are aware of the new hash entry as soon
+ * as they open our table.
+ */
+ CacheInvalidateRelcacheImmediate(relid);
+
+ /*
+ * Also make sure that the existing users of the table update their
+ * relcache entry as soon as they try to run DML commands on it.
+ *
+ * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+ * has a lower lock, we assume it'll accept our invalidation message when
+ * it changes the lock mode.
+ *
+ * Before upgrading the lock on the relation, close the index temporarily
+ * to avoid a deadlock if another backend running DML already has its lock
+ * (ShareLock) on the table and waits for the lock on the index.
+ */
+ if (index)
+ {
+ indexid = RelationGetRelid(index);
+ index_close(index, ShareUpdateExclusiveLock);
+ }
+ LockRelationOid(relid, ShareLock);
+ UnlockRelationOid(relid, ShareLock);
+ if (OidIsValid(indexid))
+ {
+ /*
+ * Re-open the index and check that it hasn't changed while unlocked.
+ */
+ check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock);
+
+ /*
+ * Return the new relcache entry to the caller. (It's been locked by
+ * the call above.)
+ */
+ index = index_open(indexid, NoLock);
+ *index_p = index;
+ }
/* Avoid logical decoding of other relations by this backend. */
repacked_rel_locator = rel->rd_locator;
@@ -2244,15 +2446,176 @@ begin_concurrent_repack(Relation rel)
/*
* Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
*/
static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
{
+ RepackedRel key;
+ RepackedRel *entry = NULL;
+ RepackedRel *entry_toast = NULL;
+ Oid relid = repacked_rel;
+ Oid toastrelid = repacked_rel_toast;
+
+ /* Remove the relation from the hash if we managed to insert one. */
+ if (OidIsValid(repacked_rel))
+ {
+ LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+
+ memset(&key, 0, sizeof(key));
+ key.relid = repacked_rel;
+ key.dbid = MyDatabaseId;
+
+ entry = hash_search(repackedRelsHash, &key, HASH_REMOVE, NULL);
+
+ /* Remove the TOAST relation if there is one. */
+ if (OidIsValid(repacked_rel_toast))
+ {
+ key.relid = repacked_rel_toast;
+ entry_toast = hash_search(repackedRelsHash, &key, HASH_REMOVE,
+ NULL);
+ }
+
+ LWLockRelease(RepackedRelsLock);
+
+ /*
+ * Make others refresh their information whether they should still
+ * treat the table as catalog from the perspective of writing WAL.
+ *
+ * XXX Unlike entering the entry into the hashtable, we do not bother
+ * with locking and unlocking the table here:
+ *
+ * 1) On normal completion (and sometimes even on ERROR), the caller
+ * is already holding AccessExclusiveLock on the table, so there
+ * should be no relcache reference unaware of this change.
+ *
+ * 2) In the other cases, the worst scenario is that the other
+ * backends will write unnecessary information to WAL until they close
+ * the relation.
+ *
+ * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+ * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+ * that probably should not try to acquire any lock.)
+ */
+ CacheInvalidateRelcacheImmediate(repacked_rel);
+
+ /*
+ * By clearing repacked_rel we also disable
+ * cluster_before_shmem_exit_callback().
+ */
+ repacked_rel = InvalidOid;
+ repacked_rel_toast = InvalidOid;
+ }
+
/*
* Restore normal function of (future) logical decoding for this backend.
*/
repacked_rel_locator.relNumber = InvalidOid;
repacked_rel_toast_locator.relNumber = InvalidOid;
+
+ /*
+ * On normal completion (!error), we should not really fail to remove the
+ * entry. But if it wasn't there for any reason, raise ERROR to make sure
+ * the transaction is aborted: if other transactions, while changing the
+ * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+ * progress, they could have missed to WAL enough information, and thus we
+ * could have produced an inconsistent table contents.
+ *
+ * On the other hand, if we are already handling an error, there's no
+ * reason to worry about inconsistent contents of the new storage because
+ * the transaction is going to be rolled back anyway. Furthermore, by
+ * raising ERROR here we'd shadow the original error.
+ */
+ if (!error)
+ {
+ char *relname;
+
+ if (OidIsValid(relid) && entry == NULL)
+ {
+ relname = get_rel_name(relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ /*
+ * Likewise, the TOAST relation should not have disappeared.
+ */
+ if (OidIsValid(toastrelid) && entry_toast == NULL)
+ {
+ relname = get_rel_name(key.relid);
+ if (!relname)
+ ereport(ERROR,
+ (errmsg("cache lookup failed for relation %u",
+ key.relid)));
+
+ ereport(ERROR,
+ (errmsg("relation \"%s\" not found among repacked relations",
+ relname)));
+ }
+
+ }
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+ if (OidIsValid(repacked_rel))
+ end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ *
+ * If relid is InvalidOid, check if any relation is being processed.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+ RepackedRel key,
+ *entry;
+
+ /* For particular relation we need to search in the hashtable. */
+ memset(&key, 0, sizeof(key));
+ key.relid = relid;
+ key.dbid = MyDatabaseId;
+
+ LWLockAcquire(RepackedRelsLock, LW_SHARED);
+ /*
+ * If the caller is interested whether any relation is being repacked,
+ * just check the number of entries.
+ */
+ if (!OidIsValid(relid))
+ {
+ long n = hash_get_num_entries(repackedRelsHash);
+
+ LWLockRelease(RepackedRelsLock);
+ return n > 0;
+ }
+ entry = (RepackedRel *)
+ hash_search(repackedRelsHash, &key, HASH_FIND, NULL);
+ LWLockRelease(RepackedRelsLock);
+
+ return entry != NULL;
+}
+
+/*
+ * Is this backend performing REPACK CONCURRENTLY?
+ */
+bool
+is_concurrent_repack_run_by_me(void)
+{
+ return OidIsValid(repacked_rel);
}
/*
@@ -2282,7 +2645,7 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
* useful for us.
*
* Regarding the value of need_full_snapshot, we pass false because the
- * table we are processing is present in RepackedRelsHash and therefore,
+ * table we are processing is present in repackedRelsHash and therefore,
* regarding logical decoding, treated like a catalog.
*/
ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1eb798f3e9..5e6000db086 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -31,6 +31,7 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xlogutils.h"
+#include "commands/cluster.h"
#include "fmgr.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -114,10 +115,12 @@ CheckLogicalDecodingRequirements(void)
/*
* NB: Adding a new requirement likely means that RestoreSlotFromDisk()
- * needs the same check.
+ * needs the same check. (Except that only temporary slots should be
+ * created for REPACK CONCURRENTLY, which effectively raises wal_level to
+ * LOGICAL.)
*/
-
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if ((wal_level < WAL_LEVEL_LOGICAL && !is_concurrent_repack_run_by_me())
+ || wal_level < WAL_LEVEL_REPLICA)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e9ddf39500c..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -151,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, RepackShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -344,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ RepackShmemInit();
}
/*
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 7fa8d9247e0..ab30d448d42 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1325,13 +1325,13 @@ LogStandbySnapshot(void)
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
recptr = LogCurrentRunningXacts(running);
/* Release lock if we kept it longer ... */
- if (wal_level >= WAL_LEVEL_LOGICAL)
+ if (XLogLogicalInfoActive())
LWLockRelease(ProcArrayLock);
/* GetRunningTransactionData() acquired XidGenLock, we must release it */
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 02505c88b8e..ecaa2283c2a 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1643,6 +1643,27 @@ CacheInvalidateRelcache(Relation relation)
databaseId, relationId);
}
+/*
+ * CacheInvalidateRelcacheImmediate
+ * Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+ SharedInvalidationMessage msg;
+
+ msg.rc.id = SHAREDINVALRELCACHE_ID;
+ msg.rc.dbId = MyDatabaseId;
+ msg.rc.relId = relid;
+ /* check AddCatcacheInvalidationMessage() for an explanation */
+ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+ SendSharedInvalidMessages(&msg, 1);
+}
+
/*
* CacheInvalidateRelcacheAll
* Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4911642fb3c..504cb8e56a8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1279,6 +1279,10 @@ retry:
/* make sure relation is marked as having no open file yet */
relation->rd_smgr = NULL;
+ /* Is REPACK CONCURRENTLY in progress? */
+ relation->rd_repack_concurrent =
+ is_concurrent_repack_in_progress(targetRelId);
+
/*
* now we can free the memory allocated for pg_class_tuple
*/
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..a325bb1d16b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,6 +95,12 @@ typedef enum RecoveryState
extern PGDLLIMPORT int wal_level;
+/*
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * enabled transiently.
+ */
+extern PGDLLIMPORT int wal_level_transient;
+
/* Is WAL archiving enabled (always or only while server is running normally)? */
#define XLogArchivingActive() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
@@ -122,8 +128,13 @@ extern PGDLLIMPORT int wal_level;
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
-/* Do we need to WAL-log information required only for logical replication? */
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+/*
+ * Do we need to WAL-log information required only for logical replication?
+ *
+ * wal_level_transient overrides wal_level if logical decoding needs to be
+ * active transiently.
+ */
+#define XLogLogicalInfoActive() (Max(wal_level, wal_level_transient) == WAL_LEVEL_LOGICAL)
#ifdef WAL_DEBUG
extern PGDLLIMPORT bool XLOG_DEBUG;
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4914f217267..9d5a30d0689 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -150,5 +150,10 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
MultiXactId cutoffMulti,
char newrelpersistence);
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+extern bool is_concurrent_repack_run_by_me(void);
+
extern void repack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
#endif /* CLUSTER_H */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
extern void CacheInvalidateRelcache(Relation relation);
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
extern void CacheInvalidateRelcacheAll(void);
extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..cc84592eb1f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
bool pgstat_enabled; /* should relation stats be counted */
/* use "struct" here to avoid needing to include pgstat.h: */
struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+ /* Is REPACK CONCURRENTLY being performed on this relation? */
+ bool rd_repack_concurrent;
} RelationData;
@@ -708,12 +711,16 @@ RelationCloseSmgr(Relation relation)
* it would complicate decoding slightly for little gain). Note that we *do*
* log information for user defined catalog tables since they presumably are
* interesting to the user...
+ *
+ * If particular relations require that, the logical decoding can be active
+ * even if wal_level is REPLICA. Do not log other relations in that case.
*/
#define RelationIsLogicallyLogged(relation) \
(XLogLogicalInfoActive() && \
RelationNeedsWAL(relation) && \
(relation)->rd_rel->relkind != RELKIND_FOREIGN_TABLE && \
- !IsCatalogRelation(relation))
+ !IsCatalogRelation(relation) && \
+ (wal_level == WAL_LEVEL_LOGICAL || (relation)->rd_repack_concurrent))
/* routines in utils/cache/relcache.c */
extern void RelationIncrementReferenceCount(Relation rel);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 30ffe509239..e71b8a19116 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -15,7 +15,6 @@ REGRESS = injection_points hashagg reindex_conc vacuum
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
ISOLATION = basic inplace syscache-update-pruned repack
-ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
deleted file mode 100644
index c8f264bc6cb..00000000000
--- a/src/test/modules/injection_points/logical.conf
+++ /dev/null
@@ -1 +0,0 @@
-wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index c7daa669548..13c2b627a0b 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -51,9 +51,6 @@ tests += {
'syscache-update-pruned',
],
'runningcheck': false, # see syscache-update-pruned
- # 'repack' requires wal_level = 'logical'.
- 'regress_args': ['--temp-config', files('logical.conf')],
-
},
'tap': {
'env': {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 879977ea41f..add58883124 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2532,6 +2532,7 @@ ReorderBufferTupleCidKey
ReorderBufferUpdateProgressTxnCB
ReorderTuple
RepOriginId
+RepackedRel
RepackDecodingState
RepackStmt
ReparameterizeForeignPathByChild_function
--
2.47.1
Hello
I started to read through 0001 and my first reaction is that I would
like to make a few more breaking changes. It appears that the current
patch tries to keep things unchanged, or to keep some things with the
CLUSTER name. I'm going to try and get rid of that. For instance, in
the grammar, I'm going to change the productions for CLUSTER so that
they produce a RepackStmt of the appropriate form; ClusterStmt as a
parse node should just be removed as no longer useful. Also, the
cluster() function becomes ExecRepack(), and things flow down from
there.
This leads to a small problem: right now you can say "CLUSTER;" which
processes all tables for which a clustered index has been defined.
In the submitted patch, REPACK doesn't support that mode, and we need a
way to distinguish the mode where it VACUUM FULL all tables from the
mode where it does the all-table CLUSTER. So I propose we allow that
mode with simply
CLUSTER USING INDEX;
which is consistent with the idea of "REPACK table USING INDEX foo"
being "CLUSTER table USING foo".
Implementation-wise, I'm toying with adding a new "command" for REPACK
called REPACK_COMMAND_CLUSTER_ALL, supporting this mode of operation.
That leads to a grammar like
RepackStmt:
REPACK USING INDEX
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER_ALL;
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
$$ = (Node *) n;
}
| REPACK qualified_name opt_using_index
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_REPACK;
n->relation = $2;
n->indexname = $3;
n->params = NIL;
$$ = (Node *) n;
}
| REPACK '(' utility_option_list ')' qualified_name opt_using_index
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_REPACK;
n->relation = $5;
n->indexname = $6;
n->params = $3;
$$ = (Node *) n;
}
| CLUSTER '(' utility_option_list ')'
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER_ALL;
n->relation = NULL;
n->indexname = NULL;
n->params = $3;
$$ = (Node *) n;
}
| CLUSTER '(' utility_option_list ')' qualified_name cluster_index_specification
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER;
n->relation = $5;
n->indexname = $6;
n->params = $3;
$$ = (Node *) n;
}
/* unparenthesized VERBOSE kept for pre-14 compatibility */
| CLUSTER opt_verbose qualified_name cluster_index_specification
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER;
n->relation = $3;
n->indexname = $4;
n->params = NIL;
if ($2)
n->params = lappend(n->params, makeDefElem("verbose", NULL, @2));
$$ = (Node *) n;
}
/* unparenthesized VERBOSE kept for pre-17 compatibility */
| CLUSTER opt_verbose
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER_ALL;
n->relation = NULL;
n->indexname = NULL;
n->params = NIL;
if ($2)
n->params = lappend(n->params, makeDefElem("verbose", NULL, @2));
$$ = (Node *) n;
}
/* kept for pre-8.3 compatibility */
| CLUSTER opt_verbose name ON qualified_name
{
RepackStmt *n = makeNode(RepackStmt);n->command = REPACK_COMMAND_CLUSTER;
n->relation = $5;
n->indexname = $3;
n->params = NIL;
if ($2)
n->params = lappend(n->params, makeDefElem("verbose", NULL, @2));
$$ = (Node *) n;
}
;opt_using_index:
ExistingIndex { $$ = $1; }
| /*EMPTY*/ { $$ = NULL; }
;cluster_index_specification:
USING name { $$ = $2; }
| /*EMPTY*/ { $$ = NULL; }
;
It's a bit weird that CLUSTER uses just "USING" while REPACK uses
"USING INDEX", but of course we cannot change CLUSTER now; and I think
it's better to have the noise word INDEX for REPACK because it allows
the case of not specifying an index name as described above.
In the current patch we don't yet have a way to use REPACK for an
unadorned "VACUUM FULL" (which processes all tables), but I think I'll
add that as well, if only so that we can claim that the REPACK commands
handles all possible legacy command modes, even if they are not useful
in practice. That would probably be unadorned "REPACK;". With this, we
can easily state that all legacy CLUSTER and VACUUM FULL commands have
an equivalent REPACK formulation.
We're of course not going to _remove_ support for any of those legacy
commands. At the same time, we're not planning to add support for
CONCURRENTLY in the all-tables modes (patch 0004), but I don't think
that's a concern. Somebody could later implement that if they wanted
to, but I think it's pretty useless so IMO it's a waste of time.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)