logical decoding of two-phase transactions
Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
as usual transaction when commit) [1]/messages/by-id/EE7452CA-3C39-4A0E-97EC-17A414972884@postgrespro.ru I’ve slightly polished things and used test_decoding output plugin as client.
General idea quite simple here:
* Write gid along with commit/prepare records in case of 2pc
* Add several routines to decode prepare records in the same way as it already happens in logical decoding.
I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. If
somebody can create scenario that will block decoding because of existing dummy backend lock that will be great
help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from
adjacent mail thread).
If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication.
[1]: /messages/by-id/EE7452CA-3C39-4A0E-97EC-17A414972884@postgrespro.ru
Attachments:
logical_twophase.diffapplication/octet-stream; name=logical_twophase.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..af81c47 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -25,6 +25,7 @@ BEGIN;
INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
+LOCK test_prepared1;
PREPARE TRANSACTION 'test_prepared#3';
-- test that we decode correctly while an uncommitted prepared xact
-- with ddl exists.
@@ -44,27 +45,33 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
+ PREPARE
+ COMMIT PREPARED
BEGIN
table public.test_prepared1: INSERT: id[integer]:2
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE
+ ABORT PREPARED
BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+ COMMIT PREPARED
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(28 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..ac76b8c 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -25,6 +25,7 @@ BEGIN;
INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
+LOCK test_prepared1;
PREPARE TRANSACTION 'test_prepared#3';
-- test that we decode correctly while an uncommitted prepared xact
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 949e9a7..53ced57 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -232,10 +232,25 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ switch(txn->xact_action)
+ {
+ case XLOG_XACT_COMMIT:
+ appendStringInfoString(ctx->out, "COMMIT");
+ break;
+ case XLOG_XACT_PREPARE:
+ appendStringInfoString(ctx->out, "PREPARE");
+ break;
+ case XLOG_XACT_COMMIT_PREPARED:
+ appendStringInfoString(ctx->out, "COMMIT PREPARED");
+ break;
+ case XLOG_XACT_ABORT_PREPARED:
+ appendStringInfoString(ctx->out, "ABORT PREPARED");
+ break;
+ }
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 91d27d0..679f457 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -98,10 +98,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
if (parsed->xinfo & XACT_XINFO_HAS_TWOPHASE)
{
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
+ uint8 gidlen = xl_twophase->gidlen;
parsed->twophase_xid = xl_twophase->xid;
+ data += MinSizeOfXactTwophase;
- data += sizeof(xl_xact_twophase);
+ strcpy(parsed->twophase_gid, data);
+ data += gidlen;
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +142,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -164,10 +177,13 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
if (parsed->xinfo & XACT_XINFO_HAS_TWOPHASE)
{
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
+ uint8 gidlen = xl_twophase->gidlen;
parsed->twophase_xid = xl_twophase->xid;
+ data += MinSizeOfXactTwophase;
- data += sizeof(xl_xact_twophase);
+ strcpy(parsed->twophase_gid, data);
+ data += gidlen;
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604..964bcaf 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -130,7 +130,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -188,12 +187,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -1236,6 +1237,41 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1389,11 +1425,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2038,7 +2075,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2061,7 +2099,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
nchildren, children, nrels, rels,
ninvalmsgs, invalmsgs,
initfileinval, false,
- xid);
+ xid, gid);
if (replorigin)
@@ -2123,7 +2161,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2141,7 +2180,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
recptr = XactLogAbortRecord(GetCurrentTimestamp(),
nchildren, children,
nrels, rels,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e47fd44..1081f8c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1230,7 +1230,7 @@ RecordTransactionCommit(void)
nchildren, children, nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1582,7 +1582,7 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- InvalidTransactionId);
+ InvalidTransactionId, NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3467,7 +3467,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5106,7 +5106,7 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- TransactionId twophase_xid)
+ TransactionId twophase_xid, const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5178,6 +5178,7 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ xl_twophase.gidlen = strlen(twophase_gid) + 1; /* Include '\0' */
}
/* dump transaction origin information */
@@ -5228,7 +5229,10 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
- XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ {
+ XLogRegisterData((char *) (&xl_twophase), MinSizeOfXactTwophase);
+ XLogRegisterData((char *) twophase_gid, xl_twophase.gidlen);
+ }
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5249,13 +5253,14 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- TransactionId twophase_xid)
+ TransactionId twophase_xid, const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
uint8 info;
@@ -5290,6 +5295,14 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ xl_twophase.gidlen = strlen(twophase_gid) + 1; /* Include '\0' */
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
}
if (xl_xinfo.xinfo != 0)
@@ -5304,6 +5317,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5321,7 +5337,13 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
- XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ {
+ XLogRegisterData((char *) (&xl_twophase), MinSizeOfXactTwophase);
+ XLogRegisterData((char *) twophase_gid, xl_twophase.gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..c15c2ed 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -221,6 +224,8 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;
+ reorder->xact_action = info;
+
switch (info)
{
case XLOG_XACT_COMMIT:
@@ -277,17 +282,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
-
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ {
+ xl_xact_parsed_prepare parsed;
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -607,6 +608,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid)) {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ strcpy(ctx->reorder->gid, parsed->twophase_gid);
+ ReorderBufferCommitBareXact(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ } else {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+ strcpy(ctx->reorder->gid, parsed->twophase_gid);
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
commit_time, origin_id, origin_lsn);
@@ -621,6 +683,22 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ XLogRecPtr commit_time = InvalidXLogRecPtr;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid)
+ && (parsed->dbId == ctx->slot->data.database)) {
+
+ strcpy(ctx->reorder->gid, parsed->twophase_gid);
+
+ ReorderBufferCommitBareXact(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fa84bd8..23176c6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1373,6 +1373,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+ txn->xact_action = rb->xact_action;
+ memcpy(txn->gid, rb->gid, GIDSIZE);
/*
* If this transaction didn't have any real changes in our database, it's
@@ -1708,6 +1710,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferCommitBareXact(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ txn->xact_action = rb->xact_action;
+ strcpy(txn->gid, rb->gid);
+
+ rb->commit(rb, txn, commit_lsn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b7ce0c6..1b8e7a0 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a123d2a..eb052f9 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -224,7 +228,10 @@ typedef struct xl_xact_invals
typedef struct xl_xact_twophase
{
TransactionId xid;
+ uint8 gidlen;
+ char gid[GIDSIZE];
} xl_xact_twophase;
+#define MinSizeOfXactTwophase offsetof(xl_xact_twophase, gid)
typedef struct xl_xact_origin
{
@@ -283,13 +290,37 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -300,6 +331,7 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
} xl_xact_parsed_abort;
@@ -331,7 +363,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -364,12 +396,12 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9e209ae..13a2195 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,14 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /*
+ * Commit callback is used for COMMIT/PREPARE/COMMMIT PREPARED,
+ * as well as abort for ROLLBACK and ROLLBACK PREPARED. Here
+ * stored actual xact action allowing decoding plugin to distinguish them.
+ */
+ uint8 xact_action;
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -299,6 +308,10 @@ struct ReorderBuffer
*/
HTAB *by_txn;
+ /* For twophase tx support we need to pass XACT action to ReorderBufferTXN */
+ uint8 xact_action;
+ char gid[GIDSIZE];
+
/*
* Transactions that could be a toplevel xact, ordered by LSN of the first
* record bearing that xid.
@@ -375,6 +388,10 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferCommitBareXact(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
On 31 December 2016 at 08:36, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client.
Sounds good.
General idea quite simple here:
* Write gid along with commit/prepare records in case of 2pc
GID is now variable sized. You seem to have added this to every
commit, not just 2PC
* Add several routines to decode prepare records in the same way as it already happens in logical decoding.
I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing.
Please explain that in comments in the patch.
If
somebody can create scenario that will block decoding because of existing dummy backend lock that will be great
help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from
adjacent mail thread).If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication.
[1] /messages/by-id/EE7452CA-3C39-4A0E-97EC-17A414972884@postgrespro.ru
We'll need some measurements about additional WAL space or mem usage
from these approaches. Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 January 2017 at 21:20, Simon Riggs <simon@2ndquadrant.com> wrote:
On 31 December 2016 at 08:36, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client.Sounds good.
General idea quite simple here:
* Write gid along with commit/prepare records in case of 2pc
GID is now variable sized. You seem to have added this to every
commit, not just 2PC
I've just realised that you're adding GID because it allows you to
uniquely identify the prepared xact. But then the prepared xact will
also have a regular TransactionId, which is also unique. GID exists
for users to specify things, but it is not needed internally and we
don't need to add it here. What we do need is for the commit prepared
message to remember what the xid of the prepare was and then re-find
it using the commit WAL record's twophase_xid field. So we don't need
to add GID to any WAL records, nor to any in-memory structures.
Please re-work the patch to include twophase_xid, which should make
the patch smaller and much faster too.
Please add comments to explain how and why patches work. Design
comments allow us to check the design makes sense and if it does
whether all the lines in the patch are needed to follow the design.
Without that patches are much harder to commit and we all want patches
to be easier to commit.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for looking into this.
On 5 Jan 2017, at 09:43, Simon Riggs <simon@2ndquadrant.com> wrote:
GID is now variable sized. You seem to have added this to every
commit, not just 2PC
Hm, didn’t realise that, i’ll fix.
I've just realised that you're adding GID because it allows you to
uniquely identify the prepared xact. But then the prepared xact will
also have a regular TransactionId, which is also unique. GID exists
for users to specify things, but it is not needed internally and we
don't need to add it here.
I think we anyway can’t avoid pushing down GID to the client side.
If we will push down only local TransactionId to remote server then we will lose mapping
of GID to TransactionId, and there will be no way for user to identify his transaction on
second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is
the same GID everywhere and it’s the same GID that was issued by the client.
Requirements for two-phase decoding can be different depending on what one want
to build around it and I believe in some situations pushing down xid is enough. But IMO
dealing with reconnects, failures and client libraries will force programmer to use
the same GID everywhere.
What we do need is for the commit prepared
message to remember what the xid of the prepare was and then re-find
it using the commit WAL record's twophase_xid field. So we don't need
to add GID to any WAL records, nor to any in-memory structures.
Other part of the story is how to find GID during decoding of commit prepared record.
I did that by adding GID field to the commit WAL record, because by the time of decoding
all memory structures that were holding xid<->gid correspondence are already cleaned up.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 January 2017 at 10:21, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Thank you for looking into this.
On 5 Jan 2017, at 09:43, Simon Riggs <simon@2ndquadrant.com> wrote:
GID is now variable sized. You seem to have added this to every
commit, not just 2PCHm, didn’t realise that, i’ll fix.
I've just realised that you're adding GID because it allows you to
uniquely identify the prepared xact. But then the prepared xact will
also have a regular TransactionId, which is also unique. GID exists
for users to specify things, but it is not needed internally and we
don't need to add it here.I think we anyway can’t avoid pushing down GID to the client side.
If we will push down only local TransactionId to remote server then we will lose mapping
of GID to TransactionId, and there will be no way for user to identify his transaction on
second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is
the same GID everywhere and it’s the same GID that was issued by the client.Requirements for two-phase decoding can be different depending on what one want
to build around it and I believe in some situations pushing down xid is enough. But IMO
dealing with reconnects, failures and client libraries will force programmer to use
the same GID everywhere.
Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?
I guess if you are using >2 nodes then you need to use full 2PC on each node.
But even then, if you adopt the naming convention that all in-progress
xacts will be called RepOriginId-EPOCH-XID, so they have a fully
unique GID on all of the child nodes then we don't need to add the
GID.
Please explain precisely how you expect to use this, to check that GID
is required.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote:
Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?I guess if you are using >2 nodes then you need to use full 2PC on each node.
Please explain precisely how you expect to use this, to check that GID
is required.
For example if we are using logical replication just for failover/HA and allowing user
to be transaction manager itself. Then suppose that user prepared tx on server A and server A
crashed. After that client may want to reconnect to server B and commit/abort that tx.
But user only have GID that was used during prepare.
But even then, if you adopt the naming convention that all in-progress
xacts will be called RepOriginId-EPOCH-XID, so they have a fully
unique GID on all of the child nodes then we don't need to add the
GID.
Yes, that’s also possible but seems to be less flexible restricting us to some
specific GID format.
Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
to know exactly what will be the cost of such approach.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 January 2017 at 20:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
to know exactly what will be the cost of such approach.
Sounds like a good idea, especially if you remove any attempt to work
with GIDs for !2PC commits at the same time.
I don't think I care about having access to the GID for the use case I
have in mind, since we'd actually be wanting to hijack a normal COMMIT
and internally transform it to PREPARE TRANSACTION, <do stuff>, COMMIT
PREPARED. But for the more general case of logical decoding of 2PC I
can see the utility of having the xact identifier.
If we presume we're only interested in logically decoding 2PC xacts
that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of
writing the GID by looking it up in our shmem state at decoding-time
for PREPARE TRANSACTION? If we can't find the prepared transaction in
TwoPhaseState we know to expect a following ROLLBACK PREPARED or
COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION
stage.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 January 2017 at 12:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote:
Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?I guess if you are using >2 nodes then you need to use full 2PC on each node.
Please explain precisely how you expect to use this, to check that GID
is required.For example if we are using logical replication just for failover/HA and allowing user
to be transaction manager itself. Then suppose that user prepared tx on server A and server A
crashed. After that client may want to reconnect to server B and commit/abort that tx.
But user only have GID that was used during prepare.
I don't think that's the case your trying to support and I don't think
that's a common case that we want to pay the price to put into core in
a non-optional way.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 January 2017 at 20:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote:
Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?I guess if you are using >2 nodes then you need to use full 2PC on each node.
Please explain precisely how you expect to use this, to check that GID
is required.For example if we are using logical replication just for failover/HA and allowing user
to be transaction manager itself. Then suppose that user prepared tx on server A and server A
crashed. After that client may want to reconnect to server B and commit/abort that tx.
But user only have GID that was used during prepare.But even then, if you adopt the naming convention that all in-progress
xacts will be called RepOriginId-EPOCH-XID, so they have a fully
unique GID on all of the child nodes then we don't need to add the
GID.Yes, that’s also possible but seems to be less flexible restricting us to some
specific GID format.Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
to know exactly what will be the cost of such approach.
Stas,
Have you had a chance to look at this further?
I think the approach of storing just the xid and fetching the GID
during logical decoding of the PREPARE TRANSACTION is probably the
best way forward, per my prior mail. That should eliminate Simon's
objection re the cost of tracking GIDs and still let us have access to
them when we want them, which is the best of both worlds really.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Yes, that’s also possible but seems to be less flexible restricting us to some
specific GID format.Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
to know exactly what will be the cost of such approach.Stas,
Have you had a chance to look at this further?
Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to
perform some test to estimate how much disk space is actually wasted by extra WAL records.
I think the approach of storing just the xid and fetching the GID
during logical decoding of the PREPARE TRANSACTION is probably the
best way forward, per my prior mail.
I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time
when logical decoding will happened transaction will be already committed/aborted and there will
be no easy way to get that GID. I thought about several possibilities:
* Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
and commit we’ll lose that mapping.
* We can provide some hooks on prepared tx recovery during startup, but that approach also fails
if reboot happened between commit and decoding of that commit.
* Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much.
So to support user-accessible 2PC over replication based on 2PC decoding we should invent
something more nasty like writing them into a table.
That should eliminate Simon's
objection re the cost of tracking GIDs and still let us have access to
them when we want them, which is the best of both worlds really.
Having 2PC decoding in core is a good thing anyway even without GID tracking =)
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26 Jan. 2017 18:43, "Stas Kelvich" <s.kelvich@postgrespro.ru> wrote:
Yes, that’s also possible but seems to be less flexible restricting us
to some
specific GID format.
Anyway, I can measure WAL space overhead introduced by the GID’s inside
commit records
to know exactly what will be the cost of such approach.
I think the approach of storing just the xid and fetching the GID
during logical decoding of the PREPARE TRANSACTION is probably the
best way forward, per my prior mail.
I don’t think that’s possible in this way. If we will not put GID in commit
record, than by the time when logical decoding will happened transaction
will be already committed/aborted and there will
be no easy way to get that GID.
My thinking is that if the 2PC xact is by that point COMMIT PREPARED or
ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode
it as a normal xact. Its gid has ceased to be significant and no longer
holds meaning since the xact is resolved.
The point of logical decoding of 2pc is to allow peers to participate in a
decision on whether to commit or not. Rather than only being able to decode
the xact once committed as is currently the case.
If it's already committed there's no point treating it as anything special.
So when we get to the prepare transaction in xlog we look to see if it's
already committed / rolled back. If so we proceed normally like current
decoding does. Only if it's still prepared do we decode it as 2pc and
supply the gid to a new output plugin callback for prepared xacts.
I thought about several possibilities:
* Tracking xid/gid map in memory also doesn’t help much — if server reboots
between prepare
and commit we’ll lose that mapping.
Er what? That's why I suggested using the prepared xacts shmem state. It's
persistent as you know from your work on prepared transaction files. It has
all the required info.
On 26 Jan 2017, at 12:51, Craig Ringer <craig@2ndquadrant.com> wrote:
* Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
and commit we’ll lose that mapping.Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info.
Imagine following scenario:
1. PREPARE happend
2. PREPARE decoded and sent where it should be sent
3. We got all responses from participating nodes and issuing COMMIT/ABORT
4. COMMIT/ABORT decoded and sent
After step 3 there is no more memory state associated with that prepared tx, so if will fail
between 3 and 4 then we can’t know GID unless we wrote it commit record (or table).
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26 January 2017 at 19:34, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
Imagine following scenario:
1. PREPARE happend
2. PREPARE decoded and sent where it should be sent
3. We got all responses from participating nodes and issuing COMMIT/ABORT
4. COMMIT/ABORT decoded and sentAfter step 3 there is no more memory state associated with that prepared tx, so if will fail
between 3 and 4 then we can’t know GID unless we wrote it commit record (or table).
If the decoding session crashes/disconnects and restarts between 3 and
4, we know the xact is now committed or rolled backand we don't care
about its gid anymore, we can decode it as a normal committed xact or
skip over it if aborted. If Pg crashes between 3 and 4 the same
applies, since all decoding sessions must restart.
No decoding session can ever start up between 3 and 4 without passing
through 1 and 2, since we always restart decoding at restart_lsn and
restart_lsn cannot be advanced past the assignment (BEGIN) of a given
xid until we pass its commit record and the downstream confirms it has
flushed the results.
The reorder buffer doesn't even really need to keep track of the gid
between 3 and 4, though it should do to save the output plugin and
downstream the hassle of keeping an xid to gid mapping. All it needs
is to know if we sent a given xact's data to the output plugin at
PREPARE time, so we can suppress sending them again at COMMIT time,
and we can store that info on the ReorderBufferTxn. We can store the
gid there too.
We'll need two new output plugin callbacks
prepare_cb
rollback_cb
since an xact can roll back after we decode PREPARE TRANSACTION (or
during it, even) and we have to be able to tell the downstream to
throw the data away.
I don't think the rollback callback should be called
abort_prepared_cb, because we'll later want to add the ability to
decode interleaved xacts' changes as they are made, before commit, and
in that case will also need to know if they abort. We won't care if
they were prepared xacts or not, but we'll know based on the
ReorderBufferTXN anyway.
We don't need a separate commit_prepared_cb, the existing commit_cb is
sufficient. The gid will be accessible on the ReorderBufferTXN.
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?
I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
By the way, I have moved this patch to next CF, you guys seem to make
the discussion move on.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 Jan. 2017 19:29, "Michael Paquier" <michael.paquier@gmail.com> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?
I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.
TL;DR: this lets us decode the xact after prepare but before commit so
decoding/replay outcomes can affect the commit-or-abort decision.
The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
That's where you've misunderstood - it isn't committed yet. The point or
this change is to allow us to do logical decoding at the PREPARE
TRANSACTION point. The xact is not yet committed or rolled back.
This allows the results of logical decoding - or more interestingly results
of replay on another node / to another app / whatever to influence the
commit or rollback decision.
Stas wants this for a conflict-free logical semi-synchronous replication
multi master solution. At PREPARE TRANSACTION time we replay the xact to
other nodes, each of which applies it and PREPARE TRANSACTION, then replies
to confirm it has successfully prepared the xact. When all nodes confirm
the xact is prepared it is safe for the origin node to COMMIT PREPARED. The
other nodes then see hat the first node has committed and they commit too.
Alternately if any node replies "could not replay xact" or "could not
prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
nodes see that and rollback too.
This makes it possible to be much more confident that what's replicated is
exactly the same on all nodes, with no after-the-fact MM conflict
resolution that apps must be aware of to function correctly.
To really make it rock solid you also have to send the old and new values
of a row, or have row versions, or send old row hashes. Something I also
want to have, but we can mostly get that already with REPLICA IDENTITY FULL.
It is of interest to me because schema changes in MM logical replication
are more challenging awkward and restrictive without it. Optimistic
conflict resolution doesn't work well for schema changes and once the
conflciting schema changes are committed on different nodes there is no
going back. So you need your async system to have a global locking model
for schema changes to stop conflicts arising. Or expect the user not to do
anything silly / misunderstand anything and know all the relevant system
limitations and requirements... which we all know works just great in
practice. You also need a way to ensure that schema changes don't render
committed-but-not-yet-replayed row changes from other peers nonsensical.
The safest way is a barrier where all row changes committed on any node
before committing the schema change on the origin node must be fully
replayed on every other node, making an async MM system temporarily sync
single master (and requiring all nodes to be up and reachable). Otherwise
you need a way to figure out how to conflict-resolve incoming rows with
missing columns / added columns / changed types / renamed tables etc which
is no fun and nearly impossible in the general case.
2PC decoding lets us avoid all this mess by sending all nodes the proposed
schema change and waiting until they all confirm successful prepare before
committing it. It can also be used to solve the row compatibility problems
with some more lazy inter-node chat in logical WAL messages.
I think the purpose of having the GID available to the decoding output
plugin at PREPARE TRANSACTION time is that it can co-operate with a global
transaction manager that way. Each node can tell the GTM "I'm ready to
commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
xid) tuple, but it'd be nice for coordinating with external systems,
simplifying inter node chatter, integrating logical deocding into bigger
systems with external transaction coordinators/arbitrators etc. It seems
pretty silly _not_ to have it really.
Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.
On 31.01.2017 09:29, Michael Paquier wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
Sorry, may be I do not completely understand your arguments.
Actually our multimaster is completely based now on logical replication
and 2PC (more precisely we are using 3PC now:)
State of transaction (prepared, precommitted, committed) should be
persisted in WAL to make it possible to perform recovery.
Recovery can involve transactions in any state. So there three records
in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either
PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 Jan. 2017 22:43, "Konstantin Knizhnik" <k.knizhnik@postgrespro.ru>
wrote:
On 31.01.2017 09:29, Michael Paquier wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com>
wrote:Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
in any state. So there three records in the WAL: PREPARE, PRECOMMIT,
COMMIT_PREPARED and
recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED
either just COMMIT_PREPARED.
That's your modified Pg though.
This 2pc logical decoding patch proposal is for core and I think it just
confused things to introduce discussion of unrelated changes made by your
product to the codebase.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
That's where you've misunderstood - it isn't committed yet. The point or
this change is to allow us to do logical decoding at the PREPARE TRANSACTION
point. The xact is not yet committed or rolled back.
Yes, I got that. I was looking for a why or an actual use-case.
Stas wants this for a conflict-free logical semi-synchronous replication
multi master solution.
This sentence is hard to decrypt, less without "multi master" as the
concept applies basically only to only one master node.
At PREPARE TRANSACTION time we replay the xact to
other nodes, each of which applies it and PREPARE TRANSACTION, then replies
to confirm it has successfully prepared the xact. When all nodes confirm the
xact is prepared it is safe for the origin node to COMMIT PREPARED. The
other nodes then see hat the first node has committed and they commit too.
OK, this is the argument I was looking for. So in your schema the
origin node, the one generating the changes, is itself in charge of
deciding if the 2PC should work or not. There are two channels between
the origin node and the replicas replaying the logical changes, one is
for the logical decoder with a receiver, the second one is used to
communicate the WAL apply status. I thought about something like
postgres_fdw doing this job with a transaction that does writes across
several nodes, that's why I got confused about this feature.
Everything goes through one channel, so the failure handling is just
simplified.
Alternately if any node replies "could not replay xact" or "could not
prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
nodes see that and rollback too.
The origin node could just issue the ROLLBACK or COMMIT and the
logical replicas would just apply this change.
To really make it rock solid you also have to send the old and new values of
a row, or have row versions, or send old row hashes. Something I also want
to have, but we can mostly get that already with REPLICA IDENTITY FULL.
On a primary key (or a unique index), the default replica identity is
enough I think.
It is of interest to me because schema changes in MM logical replication are
more challenging awkward and restrictive without it. Optimistic conflict
resolution doesn't work well for schema changes and once the conflicting
schema changes are committed on different nodes there is no going back. So
you need your async system to have a global locking model for schema changes
to stop conflicts arising. Or expect the user not to do anything silly /
misunderstand anything and know all the relevant system limitations and
requirements... which we all know works just great in practice. You also
need a way to ensure that schema changes don't render
committed-but-not-yet-replayed row changes from other peers nonsensical. The
safest way is a barrier where all row changes committed on any node before
committing the schema change on the origin node must be fully replayed on
every other node, making an async MM system temporarily sync single master
(and requiring all nodes to be up and reachable). Otherwise you need a way
to figure out how to conflict-resolve incoming rows with missing columns /
added columns / changed types / renamed tables etc which is no fun and
nearly impossible in the general case.
That's one vision of things, FDW-like approaches would be a second,
but those are not able to pass down utility statements natively,
though this stuff can be done with the utility hook.
I think the purpose of having the GID available to the decoding output
plugin at PREPARE TRANSACTION time is that it can co-operate with a global
transaction manager that way. Each node can tell the GTM "I'm ready to
commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
xid) tuple, but it'd be nice for coordinating with external systems,
simplifying inter node chatter, integrating logical deocding into bigger
systems with external transaction coordinators/arbitrators etc. It seems
pretty silly _not_ to have it really.
Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM,
this way the COMMIT/ABORT PREPARED can be issued from any nodes, and
there is a centralized conflict resolution, the latter being done with
a huge cost, causing much bottleneck in scaling performance.
Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.
I think it should be added reading this thread.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.I think it should be added reading this thread.
+1. If on the logical replication master the user executes PREPARE
TRANSACTION 'mumble', isn't it sensible to want the logical replica to
prepare the same set of changes with the same GID? To me, that not
only seems like *a* sensible thing to want to do but probably the
*most* sensible thing to want to do. And then, when the eventual
COMMIT PREPAPARED 'mumble' comes along, you want to have the replica
run the same command. If you don't do that, then the alternative is
that the replica has to make up new names based on the master's XID.
But that kinda sucks, because now if replication stops due to a
conflict or whatever and you have to disentangle things by hand, all
the names on the replica are basically meaningless.
Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me. For that to really add up
to a significant cost, wouldn't you need to be doing LOTS of 2PC
transactions, each with very little work, so that the commit/abort
prepared records weren't swamped by everything else? That seems like
an unlikely scenario, but if it does happen, that's exactly when
you'll be most grateful for the GID tracking. I think.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.
I'm confused ... isn't it there already? If not, how do we handle
reconstructing 2PC state from WAL at all?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/01/2017 10:32 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.I'm confused ... isn't it there already? If not, how do we handle
reconstructing 2PC state from WAL at all?regards, tom lane
Right now logical decoding ignores prepare and take in account only "commit prepared":
/*
* Currently decoding ignores PREPARE TRANSACTION and will just
* decode the transaction when the COMMIT PREPARED is sent or
* throw away the transaction's contents when a ROLLBACK PREPARED
* is received. In the future we could add code to expose prepared
* transactions in the changestream allowing for a kind of
* distributed 2PC.
*/
For some scenarios it works well, but if we really need prepared state at replica (as in case of multimaster), then it is not enough.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 Feb. 2017 08:32, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.
I'm confused ... isn't it there already? If not, how do we handle
reconstructing 2PC state from WAL at all?
Right. Per my comments uothread I don't see why we need to add anything
more to WAL here.
Stas was concerned about what happens in logical decoding if we crash
between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
and decode the whole txn again anyway so it doesn't matter.
We can just track it on ReorderBufferTxn when we see it at PREPARE
TRANSACTION time.
On Wed, Feb 1, 2017 at 2:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.I'm confused ... isn't it there already? If not, how do we handle
reconstructing 2PC state from WAL at all?
By XID. See xl_xact_twophase, which gets included in xl_xact_commit
or xl_xact_abort. The GID has got to be there in the XL_XACT_PREPARE
record, but not when actually committing/rolling back.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Right. Per my comments uothread I don't see why we need to add anything more
to WAL here.Stas was concerned about what happens in logical decoding if we crash
between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
and decode the whole txn again anyway so it doesn't matter.We can just track it on ReorderBufferTxn when we see it at PREPARE
TRANSACTION time.
Oh, hmm. I guess if that's how it works then we don't need it in WAL
after all. I'm not sure that re-decoding the already-prepared
transaction is a very good plan, but if that's what we're doing anyway
this patch probably shouldn't change it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3 February 2017 at 03:34, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
Right. Per my comments uothread I don't see why we need to add anything more
to WAL here.Stas was concerned about what happens in logical decoding if we crash
between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back
and decode the whole txn again anyway so it doesn't matter.We can just track it on ReorderBufferTxn when we see it at PREPARE
TRANSACTION time.Oh, hmm. I guess if that's how it works then we don't need it in WAL
after all. I'm not sure that re-decoding the already-prepared
transaction is a very good plan, but if that's what we're doing anyway
this patch probably shouldn't change it.
We don't have much choice at the moment.
Logical decoding must restart from the xl_running_xacts most recently
prior to the xid allocation for the oldest xact the client hasn't
confirmed receipt of decoded data + commit for. That's because reorder
buffers are not persistent; if a decoding session crashes we throw
away accumulated reorder buffers, both those in memory and those
spilled to disk. We have to re-create them by restarting decoding from
the beginning of the oldest xact of interest.
We could make reorder buffers persistent and shared between decoding
sessions but it'd totally change the logical decoding model and create
some other problems. It's certainly not a topic for this patch. So we
can take it as given that we'll always restart decoding from BEGIN
again at a crash.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
We could make reorder buffers persistent and shared between decoding
sessions but it'd totally change the logical decoding model and create
some other problems. It's certainly not a topic for this patch. So we
can take it as given that we'll always restart decoding from BEGIN
again at a crash.
OK, thanks for the explanation. I have never liked this design very
much, and told Andres so: big transactions are bound to cause
noticeable replication lag. But you're certainly right that it's not
a topic for this patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-02-03 17:47:50 -0500, Robert Haas wrote:
On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
We could make reorder buffers persistent and shared between decoding
sessions but it'd totally change the logical decoding model and create
some other problems. It's certainly not a topic for this patch. So we
can take it as given that we'll always restart decoding from BEGIN
again at a crash.
Sharing them seems unlikely (filtering and such would become a lot more
complicated) and separate from persistency. I'm not sure however how
it'd "totally change the logical decoding model"?
Even if we'd not always restart decoding, we'd still have the option to
add the information necessary to the spill files, so I'm unclear how
persistency plays a role here?
OK, thanks for the explanation. I have never liked this design very
much, and told Andres so: big transactions are bound to cause
noticeable replication lag. But you're certainly right that it's not
a topic for this patch.
Streaming and persistency of spill files are different topics, no?
Either would have initially complicated things beyond the point of
getting things into core - I'm all for adding them at some point.
Persistent spill files (which'd also spilling of small transactions at
regular intervals) also has the issue that it makes the spill format
something that can't be adapted in bugfixes etc, and that we need to
fsync it.
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote:
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.
Hmm, yeah, that's a problem. That smells like autonomous transactions.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote:
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.Hmm, yeah, that's a problem. That smells like autonomous transactions.
Unfortunately the last few proposals, like spawning backends, to deal
with autonomous xacts aren't really suitable for replication, unless you
only have very large ones. And it really needs to be an implementation
where ATs can freely be switched inbetween. On the other hand, a good
deal of problems (like locking) shouldn't be an issue, since there's
obviously a possible execution schedule.
I suspect this'd need some low-level implemention close to xact.c that'd
allow switching between transactions.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote:
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.Hmm, yeah, that's a problem. That smells like autonomous transactions.
Unfortunately the last few proposals, like spawning backends, to deal
with autonomous xacts aren't really suitable for replication, unless you
only have very large ones. And it really needs to be an implementation
where ATs can freely be switched inbetween. On the other hand, a good
deal of problems (like locking) shouldn't be an issue, since there's
obviously a possible execution schedule.I suspect this'd need some low-level implemention close to xact.c that'd
allow switching between transactions.
Yeah. Well, I still feel like that's also how autonomous transactions
oughta work, but I realize that's not a unanimous viewpoint. :-)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-02-03 19:09:43 -0500, Robert Haas wrote:
On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote:
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.Hmm, yeah, that's a problem. That smells like autonomous transactions.
Unfortunately the last few proposals, like spawning backends, to deal
with autonomous xacts aren't really suitable for replication, unless you
only have very large ones. And it really needs to be an implementation
where ATs can freely be switched inbetween. On the other hand, a good
deal of problems (like locking) shouldn't be an issue, since there's
obviously a possible execution schedule.I suspect this'd need some low-level implemention close to xact.c that'd
allow switching between transactions.Yeah. Well, I still feel like that's also how autonomous transactions
oughta work, but I realize that's not a unanimous viewpoint. :-)
Same here ;)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/04/2017 03:08 AM, Andres Freund wrote:
On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote:
I still haven't seen a credible model for being able to apply a stream
of interleaved transactions that can roll back individually; I think we
really need the ability to have multiple transactions alive in one
backend for that.Hmm, yeah, that's a problem. That smells like autonomous transactions.
Unfortunately the last few proposals, like spawning backends, to deal
with autonomous xacts aren't really suitable for replication, unless you
only have very large ones. And it really needs to be an implementation
where ATs can freely be switched inbetween. On the other hand, a good
deal of problems (like locking) shouldn't be an issue, since there's
obviously a possible execution schedule.I suspect this'd need some low-level implemention close to xact.c that'd
allow switching between transactions.
Let me add my two coins here:
1. We are using logical decoding in our multimaster and applying transactions concurrently by pool of workers. Unlike asynchronous replication, in multimaster we need to perform voting for each transaction commit, so if transactions are applied by single
workers, then performance will be awful and, moreover, there is big chance to get "deadlock" when none of workers can complete voting because different nodes are performing voting for different transactions.
I could not say that there are no problems with this approach. There are definitely a lot of challenges. First of all we need to use special DTM (distributed transaction manager) to provide consistent applying of transaction at different nodes. Second
problem is once again related with kind of "deadlock" explained above. Even if we apply transactions concurrently, it is still possible to get such deadlock if we do not have enough workers. This is why we allow to launch extra workers dynamically (but
finally it is limited by maximal number of configures bgworkers).
But in any case, I think that "parallel apply" is "must have" mode for logical replication.
2. We have implemented autonomous transactions in PgPro EE. Unlike proposal currently present at commit fest, we execute autonomous transaction within the same backend. So we are just storing and restoring transaction context. Unfortunately it is also not
so cheap operation. Autonomous transaction should not see any changes done by parent transaction (because it can be rollbacked after commit of autonomous transaction). But there are catalog and relation caches inside backend, so we have to clean this
caches before switching to ATX. It is quite expensive operation and so speed of execution of PL/pg-SQL function with autonomous transaction is several order of magnitude slower than without it. So autonomous transaction can be used for audits (its the
primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is not efficient for concurrent execution of multiple transaction in one backend.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 Jan 2017, at 12:22, Craig Ringer <craig@2ndquadrant.com> wrote:
Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap.
Agreed.
On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote:
Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode the whole txn again anyway so it doesn't matter.
Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me.
From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactions on slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that transaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions.
My concern is about the situation with external coordinator. That scenario is quite important for users of postgres native 2pc, notably J2EE user. Suppose user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers with ordinary synchronous physical replication. If master will crash and replica will be promoted than user can reconnect to it and commit/abort that transaction using his GID. And it is unclear to me how to achieve same behaviour with logical replication of 2pc without GID in commit record. If we will prepare with “#{xid}#{node_id}” on acceptor nodes, then if donor node will crash we’ll lose mapping between user’s gid and our internal gid; contrary we can prepare with user's GID on acceptors, but then we will not know that GID on donor during commit decode (by the time decoding happens all memory state already gone and we can’t exchange our xid to gid).
I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical, after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. Testing with 194-bytes and 6-bytes GID’s. (GID max size is 200 bytes)
-master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28
-patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0
so with 6-byte GID’s difference in WAL size is less than 1%
-master, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/B7501578
-patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/D8B43E28
and with 194-byte GID’s difference in WAL size is about 18%
So using big GID’s (as J2EE does) can cause notable WAL bloat, while small GID’s are almost unnoticeable.
May be we can introduce configuration option track_commit_gid by analogy with track_commit_timestamp and make that behaviour optional? Any objections to that?
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9 February 2017 at 21:23, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote:
Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode the whole txn again anyway so it doesn't matter.
Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me.
From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactions on slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that transaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions.
But we don't *need* to do that. We have access to the GID of the 2PC
xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we
have no need for it. So we can always use the user-supplied GID.
I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical, after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too.
Why do you do this? You don't need to. You can look the GID up from
the 2pc status table in memory unless the master already did COMMIT
PREPARED, in which case you can just decode it as a normal xact as if
it were never 2pc in the first place.
I don't think I've managed to make this point by description, so I'll
try to modify your patch to demonstrate.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/03/17 10:24, Craig Ringer wrote:
On 9 February 2017 at 21:23, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote:
Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode the whole txn again anyway so it doesn't matter.
Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me.
From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactions on slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of that transaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions.
But we don't *need* to do that. We have access to the GID of the 2PC
xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we
have no need for it. So we can always use the user-supplied GID.I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical, after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too.
Why do you do this? You don't need to. You can look the GID up from
the 2pc status table in memory unless the master already did COMMIT
PREPARED, in which case you can just decode it as a normal xact as if
it were never 2pc in the first place.I don't think I've managed to make this point by description, so I'll
try to modify your patch to demonstrate.
If I understand you correctly you are saying that if PREPARE is being
decoded, we can load the GID from the 2pc info in memory about the
specific 2pc. The info gets removed on COMMIT PREPARED but at that point
there is no real difference between replicating it as 2pc or 1pc since
the 2pc behavior is for all intents and purposes lost at that point.
Works for me. I guess the hard part is knowing if COMMIT PREPARED
happened at the time PREPARE is decoded, but I existence of the needed
info could be probably be used for that.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 March 2017 at 06:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
If I understand you correctly you are saying that if PREPARE is being
decoded, we can load the GID from the 2pc info in memory about the
specific 2pc. The info gets removed on COMMIT PREPARED but at that point
there is no real difference between replicating it as 2pc or 1pc since
the 2pc behavior is for all intents and purposes lost at that point.
Works for me. I guess the hard part is knowing if COMMIT PREPARED
happened at the time PREPARE is decoded, but I existence of the needed
info could be probably be used for that.
Right.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 Mar 2017, at 01:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
The info gets removed on COMMIT PREPARED but at that point
there is no real difference between replicating it as 2pc or 1pc since
the 2pc behavior is for all intents and purposes lost at that point.
If we are doing 2pc and COMMIT PREPARED happens then we should
replicate that without transaction body to the receiving servers since tx
is already prepared on them with some GID. So we need a way to construct
that GID.
It seems that last ~10 messages I’m failing to explain some points about this
topic. Or, maybe, I’m failing to understand some points. Can we maybe setup
skype call to discuss this and post summary here? Craig? Peter?
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 March 2017 at 15:27, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Mar 2017, at 01:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
The info gets removed on COMMIT PREPARED but at that point
there is no real difference between replicating it as 2pc or 1pc since
the 2pc behavior is for all intents and purposes lost at that point.If we are doing 2pc and COMMIT PREPARED happens then we should
replicate that without transaction body to the receiving servers since tx
is already prepared on them with some GID. So we need a way to construct
that GID.
We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.
BTW, I've been reviewing the patch in more detail. Other than a bunch
of copy-and-paste that I'm cleaning up, the main issue I've found is
that in DecodePrepare, you call:
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);
but I am not convinced it is correct to call it at PREPARE TRANSACTION
time, only at COMMIT PREPARED time. We want to see the 2pc prepared
xact's state when decoding it, but there might be later commits that
cannot yet see that state and shouldn't have it visible in their
snapshots. Imagine, say
BEGIN;
ALTER TABLE t ADD COLUMN ...
INSERT INTO 't' ...
PREPARE TRANSACTION 'x';
BEGIN;
INSERT INTO t ...;
COMMIT;
COMMIT PREPARED 'x';
We want to see the new column when decoding the prepared xact, but
_not_ when decoding the subsequent xact between the prepare and
commit. This particular case cannot occur because the lock held by
ALTER TABLE blocks the INSERT in the other xact, but how sure are you
that there are no other snapshot issues that could arise if we promote
a snapshot to visible early? What about if we ROLLBACK PREPARED after
we made the snapshot visible?
The tests don't appear to cover logical decoding 2PC sessions that do
DDL at all. I emphasised that that would be one of the main problem
areas when we originally discussed this. I'll look at adding some,
since I think this is one of the areas that's most likely to find
issues.
It seems that last ~10 messages I’m failing to explain some points about this
topic. Or, maybe, I’m failing to understand some points. Can we maybe setup
skype call to discuss this and post summary here? Craig? Peter?
Let me prep an updated patch. Time zones make it rather hard to do
voice; I'm in +0800 Western Australia, Petr is in +0200...
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 March 2017 at 16:00, Craig Ringer <craig@2ndquadrant.com> wrote:
What about if we ROLLBACK PREPARED after
we made the snapshot visible?
Yeah, I'm pretty sure that's going to be a problem actually.
You're telling the snapshot builder that an xact committed at PREPARE
TRANSACTION time.
If we then ROLLBACK PREPARED, we're in a mess. It looks like it'll
cause issues with catalogs, user-catalog tables, etc.
I suspect we need to construct a temporary snapshot to decode PREPARE
TRANSACTION then discard it. If we later COMMIT PREPARED we should
perform the current steps to merge the snapshot state in.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote:
We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.
Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?
BTW, I've been reviewing the patch in more detail. Other than a bunch
of copy-and-paste that I'm cleaning up, the main issue I've found is
that in DecodePrepare, you call:SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);but I am not convinced it is correct to call it at PREPARE TRANSACTION
time, only at COMMIT PREPARED time. We want to see the 2pc prepared
xact's state when decoding it, but there might be later commits that
cannot yet see that state and shouldn't have it visible in their
snapshots.
Agree, that is problem. That allows to decode this PREPARE, but after that
it is better to mark this transaction as running in snapshot or perform prepare
decoding with some kind of copied-end-edited snapshot. I’ll have a look at this.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote:
We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?
I was about to explain how restart_lsn works again, and how that would
mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
PREPARED or ROLLBACK PREPARED on crash. But...
Actually, the way you've implemented it, that won't be the case. You
treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
will presumably send replay confirmation after it has applied the
PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
synchronous replication. This will allow restart_lsn to advance to
after the PREPARE TRANSACTION record if there's no other older xact
and we see a suitable xl_running_xacts record. So we wouldn't decode
the PREPARE TRANSACTION again after restart.
Hm.
That's actually a pretty good reason to xlog the gid for 2pc rollback
and commit if we're at wal_level >= logical . Being able to advance
restart_lsn and avoid the re-decoding work is a big win.
Come to think of it, we have to advance the client replication
identifier as part of PREPARE TRANSACTION anyway, otherwise we'd try
to repeat and re-prepare the same xact on crash recovery.
Given that, I withdraw my objection to adding the gid to commit and
rollback xlog records, though it should only be done if they're 2pc
commit/abort, and only if XLogLogicalInfoActive().
BTW, I've been reviewing the patch in more detail. Other than a bunch
of copy-and-paste that I'm cleaning up, the main issue I've found is
that in DecodePrepare, you call:SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);but I am not convinced it is correct to call it at PREPARE TRANSACTION
time, only at COMMIT PREPARED time. We want to see the 2pc prepared
xact's state when decoding it, but there might be later commits that
cannot yet see that state and shouldn't have it visible in their
snapshots.Agree, that is problem. That allows to decode this PREPARE, but after that
it is better to mark this transaction as running in snapshot or perform prepare
decoding with some kind of copied-end-edited snapshot. I’ll have a look at this.
Thanks.
It's also worth noting that with your current approach, 2PC xacts will
produce two calls to the output plugin's commit() callback, once for
the PREPARE TRANSACTION and another for the COMMIT PREPARED or
ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a
huge fan of that. It's not entirely backward compatible since it
violates the previously safe assumption that there's a 1:1
relationship between begin and commit callbacks with no interleaving,
for one thing, and I think it's also a bit misleading to send a
PREPARE TRANSACTION to a callback that could previously only receive a
true commit.
I particularly dislike calling a commit callback for an abort. So I'd
like to look further into the interface side of things. I'm inclined
to suggest adding new callbacks for 2pc prepare, commit and rollback,
and if the output plugin doesn't set them fall back to the existing
behaviour. Plugins that aren't interested in 2PC (think ETL) should
probably not have to deal with it, we might as well just send them
only the actually committed xacts, when they commit.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/03/17 13:23, Craig Ringer wrote:
On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote:
We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?I was about to explain how restart_lsn works again, and how that would
mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
PREPARED or ROLLBACK PREPARED on crash. But...Actually, the way you've implemented it, that won't be the case. You
treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
will presumably send replay confirmation after it has applied the
PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
synchronous replication. This will allow restart_lsn to advance to
after the PREPARE TRANSACTION record if there's no other older xact
and we see a suitable xl_running_xacts record. So we wouldn't decode
the PREPARE TRANSACTION again after restart.
Unless we just don't let restart_lsn to go forward if there is 2pc that
wasn't decoded yet (twopcs store the prepare lsn) but that's probably
too much of a kludge.
It's also worth noting that with your current approach, 2PC xacts will
produce two calls to the output plugin's commit() callback, once for
the PREPARE TRANSACTION and another for the COMMIT PREPARED or
ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a
huge fan of that. It's not entirely backward compatible since it
violates the previously safe assumption that there's a 1:1
relationship between begin and commit callbacks with no interleaving,
for one thing, and I think it's also a bit misleading to send a
PREPARE TRANSACTION to a callback that could previously only receive a
true commit.I particularly dislike calling a commit callback for an abort. So I'd
like to look further into the interface side of things. I'm inclined
to suggest adding new callbacks for 2pc prepare, commit and rollback,
and if the output plugin doesn't set them fall back to the existing
behaviour. Plugins that aren't interested in 2PC (think ETL) should
probably not have to deal with it, we might as well just send them
only the actually committed xacts, when they commit.
I think this is a good approach to handle it.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/2/17 11:34 AM, Petr Jelinek wrote:
On 02/03/17 13:23, Craig Ringer wrote:
I particularly dislike calling a commit callback for an abort. So I'd
like to look further into the interface side of things. I'm inclined
to suggest adding new callbacks for 2pc prepare, commit and rollback,
and if the output plugin doesn't set them fall back to the existing
behaviour. Plugins that aren't interested in 2PC (think ETL) should
probably not have to deal with it, we might as well just send them
only the actually committed xacts, when they commit.I think this is a good approach to handle it.
It's been a while since there was any activity on this thread and a very
long time since the last patch. As far as I can see there are far more
questions than answers in this thread.
If you need more time to produce a patch, please post an explanation for
the delay and a schedule for the new patch. If no patch or explanation
is is posted by 2017-03-17 AoE I will mark this submission
"Returned with Feedback".
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/03/17 17:34, Petr Jelinek wrote:
On 02/03/17 13:23, Craig Ringer wrote:
On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote:
We already have it, because we just decoded the PREPARE TRANSACTION.
I'm preparing a patch revision to demonstrate this.Yes, we already have it, but if server reboots between commit prepared (all
prepared state is gone) and decoding of this commit prepared then we loose
that mapping, isn’t it?I was about to explain how restart_lsn works again, and how that would
mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT
PREPARED or ROLLBACK PREPARED on crash. But...Actually, the way you've implemented it, that won't be the case. You
treat PREPARE TRANSACTION as a special-case of COMMIT, and the client
will presumably send replay confirmation after it has applied the
PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with
synchronous replication. This will allow restart_lsn to advance to
after the PREPARE TRANSACTION record if there's no other older xact
and we see a suitable xl_running_xacts record. So we wouldn't decode
the PREPARE TRANSACTION again after restart.
Thinking about this some more. Why can't we use the same mechanism
standby uses, ie, use xid to identify the 2PC? If output plugin cares
about doing 2PC in two phases, it can send xid as part of its protocol
(like the PG10 logical replication and pglogical do already) and simply
remember on downstream the remote node + remote xid of the 2PC in
progress. That way there is no need for gids in COMMIT PREPARED and this
patch would be much simpler (as the tracking would be left to actual
replication implementation as opposed to decoding). Or am I missing
something?
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 March 2017 at 15:42, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
Thinking about this some more. Why can't we use the same mechanism
standby uses, ie, use xid to identify the 2PC?
It pushes work onto the downstream, which has to keep an <xid,gid>
mapping in a crash-safe, persistent form. We'll be doing a flush of
some kind anyway so we can report successful prepare to the upstream
so an additional flush of a SLRU might not be so bad for a postgres
downstream. And I guess any other clients will have some kind of
downstream persistent mapping to use.
So I think I have a mild preference for recording the gid on 2pc
commit and abort records in the master's WAL, where it's very cheap
and simple.
But I agree that just sending the xid is a viable option if that falls through.
I'm going to try to pick this patch up and amend its interface per our
discussion earlier, see if I can get it committable.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 Mar 2017, at 14:44, Craig Ringer <craig@2ndquadrant.com> wrote:
I'm going to try to pick this patch up and amend its interface per our
discussion earlier, see if I can get it committable.
I’m working right now on issue with building snapshots for decoding prepared tx.
I hope I'll send updated patch later today.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote:
BTW, I've been reviewing the patch in more detail. Other than a bunch
of copy-and-paste that I'm cleaning up, the main issue I've found is
that in DecodePrepare, you call:SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);but I am not convinced it is correct to call it at PREPARE TRANSACTION
time, only at COMMIT PREPARED time. We want to see the 2pc prepared
xact's state when decoding it, but there might be later commits that
cannot yet see that state and shouldn't have it visible in their
snapshots.Agree, that is problem. That allows to decode this PREPARE, but after that
it is better to mark this transaction as running in snapshot or perform prepare
decoding with some kind of copied-end-edited snapshot. I’ll have a look at this.
While working on this i’ve spotted quite a nasty corner case with aborted prepared
transaction. I have some not that great ideas how to fix it, but maybe i blurred my
view and missed something. So want to ask here at first.
Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx.
So pg_class will have something like this:
xmin | xmax | relname
100 | 200 | mytable
200 | 0 | mytable
After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table
again then xmax of first tuple will be set current xid, resulting in following table:
xmin | xmax | relname
100 | 300 | mytable
200 | 0 | mytable
300 | 0 | mytable
In that moment we’ve lost information that first tuple was deleted by our prepared tx.
And from POV of historic snapshot that will be constructed to decode prepare first
tuple is visible, but actually send tuple should be used. Moreover such snapshot could
see both tuples violating oid uniqueness, but heapscan stops after finding first one.
I see here two possible workarounds:
* Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax
as it was possibly deleted by our tx. Than if nothing found scan in a usual way.
* Do not decode such transaction at all. If by the time of decoding prepare record we
already know that it is aborted than such decoding doesn’t have a lot of sense.
IMO intended usage of logical 2pc decoding is to decide about commit/abort based
on answers from logical subscribers/replicas. So there will be barrier between
prepare and commit/abort and such situations shouldn’t happen.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
While working on this i’ve spotted quite a nasty corner case with aborted prepared
transaction. I have some not that great ideas how to fix it, but maybe i blurred my
view and missed something. So want to ask here at first.Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx.
So pg_class will have something like this:xmin | xmax | relname
100 | 200 | mytable
200 | 0 | mytableAfter previous abort, tuple (100,200,mytable) becomes visible and if we will alter table
again then xmax of first tuple will be set current xid, resulting in following table:xmin | xmax | relname
100 | 300 | mytable
200 | 0 | mytable
300 | 0 | mytableIn that moment we’ve lost information that first tuple was deleted by our prepared tx.
Right. And while the prepared xact has aborted, we don't control when
it aborts and when those overwrites can start happening. We can and
should check if a 2pc xact is aborted before we start decoding it so
we can skip decoding it if it's already aborted, but it could be
aborted *while* we're decoding it, then have data needed for its
snapshot clobbered.
This hasn't mattered in the past because prepared xacts (and
especially aborted 2pc xacts) have never needed snapshots, we've never
needed to do something from the perspective of a prepared xact.
I think we'll probably need to lock the 2PC xact so it cannot be
aborted or committed while we're decoding it, until we finish decoding
it. So we lock it, then check if it's already aborted/already
committed/in progress. If it's aborted, treat it like any normal
aborted xact. If it's committed, treat it like any normal committed
xact. If it's in progress, keep the lock and decode it.
People using logical decoding for 2PC will presumably want to control
2PC via logical decoding, so they're not so likely to mind such a
lock.
* Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax
as it was possibly deleted by our tx. Than if nothing found scan in a usual way.
I don't think that'll be at all viable with the syscache/relcache
machinery. Way too intrusive.
* Do not decode such transaction at all.
Yes, that's what I'd like to do, per above.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 March 2017 at 19:52, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’m working right now on issue with building snapshots for decoding prepared tx.
I hope I'll send updated patch later today.
Great.
What approach are you taking?
It looks like the snapshot builder actually does most of the work we
need for this already, maintaining a stack of snapshots we can use. It
might be as simple as invalidating the relcache/syscache when we exit
(and enter?) decoding of a prepared 2pc xact, since it violates the
usual assumption of logical decoding that we decode things strictly in
commit-time order.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
While working on this i’ve spotted quite a nasty corner case with aborted prepared
transaction. I have some not that great ideas how to fix it, but maybe i blurred my
view and missed something. So want to ask here at first.Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx.
So pg_class will have something like this:xmin | xmax | relname
100 | 200 | mytable
200 | 0 | mytableAfter previous abort, tuple (100,200,mytable) becomes visible and if we will alter table
again then xmax of first tuple will be set current xid, resulting in following table:xmin | xmax | relname
100 | 300 | mytable
200 | 0 | mytable
300 | 0 | mytableIn that moment we’ve lost information that first tuple was deleted by our prepared tx.
Right. And while the prepared xact has aborted, we don't control when
it aborts and when those overwrites can start happening. We can and
should check if a 2pc xact is aborted before we start decoding it so
we can skip decoding it if it's already aborted, but it could be
aborted *while* we're decoding it, then have data needed for its
snapshot clobbered.This hasn't mattered in the past because prepared xacts (and
especially aborted 2pc xacts) have never needed snapshots, we've never
needed to do something from the perspective of a prepared xact.I think we'll probably need to lock the 2PC xact so it cannot be
aborted or committed while we're decoding it, until we finish decoding
it. So we lock it, then check if it's already aborted/already
committed/in progress. If it's aborted, treat it like any normal
aborted xact. If it's committed, treat it like any normal committed
xact. If it's in progress, keep the lock and decode it.
But that lock could need to be held for an unbounded period of time -
as long as decoding takes to complete - which seems pretty
undesirable. Worse still, the same problem will arise if you
eventually want to start decoding ordinary, non-2PC transactions that
haven't committed yet, which I think is something we definitely want
to do eventually; the current handling of bulk loads or bulk updates
leads to significant latency. You're not going to be able to tell an
active transaction that it isn't allowed to abort until you get done
with it, and I don't really think you should be allowed to lock out
2PC aborts for long periods of time either. That's going to stink for
users.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17/03/17 03:34, Craig Ringer wrote:
On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
While working on this i’ve spotted quite a nasty corner case with aborted prepared
transaction. I have some not that great ideas how to fix it, but maybe i blurred my
view and missed something. So want to ask here at first.Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx.
So pg_class will have something like this:xmin | xmax | relname
100 | 200 | mytable
200 | 0 | mytableAfter previous abort, tuple (100,200,mytable) becomes visible and if we will alter table
again then xmax of first tuple will be set current xid, resulting in following table:xmin | xmax | relname
100 | 300 | mytable
200 | 0 | mytable
300 | 0 | mytableIn that moment we’ve lost information that first tuple was deleted by our prepared tx.
Right. And while the prepared xact has aborted, we don't control when
it aborts and when those overwrites can start happening. We can and
should check if a 2pc xact is aborted before we start decoding it so
we can skip decoding it if it's already aborted, but it could be
aborted *while* we're decoding it, then have data needed for its
snapshot clobbered.This hasn't mattered in the past because prepared xacts (and
especially aborted 2pc xacts) have never needed snapshots, we've never
needed to do something from the perspective of a prepared xact.I think we'll probably need to lock the 2PC xact so it cannot be
aborted or committed while we're decoding it, until we finish decoding
it. So we lock it, then check if it's already aborted/already
committed/in progress. If it's aborted, treat it like any normal
aborted xact. If it's committed, treat it like any normal committed
xact. If it's in progress, keep the lock and decode it.People using logical decoding for 2PC will presumably want to control
2PC via logical decoding, so they're not so likely to mind such a
lock.* Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax
as it was possibly deleted by our tx. Than if nothing found scan in a usual way.I don't think that'll be at all viable with the syscache/relcache
machinery. Way too intrusive.
I think only genam would need changes to do two-phase scan for this as
the catalog scans should ultimately go there. It's going to slow down
things but we could limit the impact by doing the two-phase scan only
when historical snapshot is in use and the tx being decoded changed
catalogs (we already have global knowledge of the first one, and it
would be trivial to add the second one as we have local knowledge of
that as well).
What I think is better strategy than filtering out by xmax would be
filtering "in" by xmin though. Meaning that first scan would return only
tuples modified by current tx which are visible in snapshot and second
scan would return the other visible tuples. That way whatever the
decoded tx seen should always win.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 March 2017 at 23:59, Robert Haas <robertmhaas@gmail.com> wrote:
But that lock could need to be held for an unbounded period of time -
as long as decoding takes to complete - which seems pretty
undesirable.
Yeah. We could use a recovery-conflict like mechanism to signal the
decoding session that someone wants to abort the xact, but it gets
messy.
Worse still, the same problem will arise if you
eventually want to start decoding ordinary, non-2PC transactions that
haven't committed yet, which I think is something we definitely want
to do eventually; the current handling of bulk loads or bulk updates
leads to significant latency.
Yeah. If it weren't for that, I'd probably still just pursue locking.
But you're right that we'll have to solve this sooner or later. I'll
admit I hoped for later.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
I think only genam would need changes to do two-phase scan for this as
the catalog scans should ultimately go there. It's going to slow down
things but we could limit the impact by doing the two-phase scan only
when historical snapshot is in use and the tx being decoded changed
catalogs (we already have global knowledge of the first one, and it
would be trivial to add the second one as we have local knowledge of
that as well).
We'll also have to clobber caches after we finish decoding a 2pc xact,
since we don't know those changes are visible to other xacts and can't
guarantee they'll ever be (if it aborts).
That's going to be "interesting" when trying to decode interleaved
transaction streams since we can't afford to clobber caches whenever
we see an xlog record from a different xact. We'll probably have to
switch to linear decoding with reordering when someone makes catalog
changes.
TBH, I have no idea how to approach the genam changes for the proposed
double-scan method. It sounds like Stas has some idea how to proceed
though (right?)
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20/03/17 09:32, Craig Ringer wrote:
On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
I think only genam would need changes to do two-phase scan for this as
the catalog scans should ultimately go there. It's going to slow down
things but we could limit the impact by doing the two-phase scan only
when historical snapshot is in use and the tx being decoded changed
catalogs (we already have global knowledge of the first one, and it
would be trivial to add the second one as we have local knowledge of
that as well).We'll also have to clobber caches after we finish decoding a 2pc xact,
since we don't know those changes are visible to other xacts and can't
guarantee they'll ever be (if it aborts).
AFAIK reorder buffer already does that.
That's going to be "interesting" when trying to decode interleaved
transaction streams since we can't afford to clobber caches whenever
we see an xlog record from a different xact. We'll probably have to
switch to linear decoding with reordering when someone makes catalog
changes.
We may need something that allows for representing multiple parallel
transactions in single process and a cheap way of switching between them
(ie, similar things we need for autonomous transactions). But that's not
something current patch has to deal with.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 Mar 2017, at 11:32, Craig Ringer <craig@2ndquadrant.com> wrote:
On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
I think only genam would need changes to do two-phase scan for this as
the catalog scans should ultimately go there. It's going to slow down
things but we could limit the impact by doing the two-phase scan only
when historical snapshot is in use and the tx being decoded changed
catalogs (we already have global knowledge of the first one, and it
would be trivial to add the second one as we have local knowledge of
that as well).TBH, I have no idea how to approach the genam changes for the proposed
double-scan method. It sounds like Stas has some idea how to proceed
though (right?)
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.
On 17 Mar 2017, at 05:38, Craig Ringer <craig@2ndquadrant.com> wrote:
On 16 March 2017 at 19:52, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’m working right now on issue with building snapshots for decoding prepared tx.
I hope I'll send updated patch later today.Great.
What approach are you taking?
Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.
OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?
So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?
Applications can record DDL in transactional logical WAL messages for
decoding during 2pc processing. Or apps can do 2pc for DML. They just
can't do both at the same time, in the same xact.
Imperfect, but a lot less invasive. And we can even permit apps to use
the locking-based approach I outlined earlier instead:
All we have to do IMO is add an output plugin callback to filter
whether we want to decode a given 2pc xact at PREPARE TRANSACTION time
or defer until COMMIT PREPARED. It could:
* mark the xact for deferred decoding at commit time (the default if
the callback doesn't exist); or
* Acquire a lock on the 2pc xact and request immediate decoding only
if it gets the lock so concurrent ROLLBACK PREPARED is blocked; or
* inspect the reorder buffer contents for row changes and decide
whether to decode now or later based on that.
It has a few downsides - for example, temp tables will be considered
"catalog changes" for now. But .. eh. We already accept a bunch of
practical limitations for catalog changes and DDL in logical decoding,
most notably regarding practical handling of full table rewrites.
Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).
That seems kind of hacky TBH. I didn't much like marking it as
committed then un-committing it.
I think it's mostly an interface issue though. I'd rather say
SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
something, to make it clear what we're doing.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 March 2017 at 23:59, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
While working on this i’ve spotted quite a nasty corner case with aborted prepared
transaction. I have some not that great ideas how to fix it, but maybe i blurred my
view and missed something. So want to ask here at first.Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx.
So pg_class will have something like this:xmin | xmax | relname
100 | 200 | mytable
200 | 0 | mytableAfter previous abort, tuple (100,200,mytable) becomes visible and if we will alter table
again then xmax of first tuple will be set current xid, resulting in following table:xmin | xmax | relname
100 | 300 | mytable
200 | 0 | mytable
300 | 0 | mytableIn that moment we’ve lost information that first tuple was deleted by our prepared tx.
Right. And while the prepared xact has aborted, we don't control when
it aborts and when those overwrites can start happening. We can and
should check if a 2pc xact is aborted before we start decoding it so
we can skip decoding it if it's already aborted, but it could be
aborted *while* we're decoding it, then have data needed for its
snapshot clobbered.This hasn't mattered in the past because prepared xacts (and
especially aborted 2pc xacts) have never needed snapshots, we've never
needed to do something from the perspective of a prepared xact.I think we'll probably need to lock the 2PC xact so it cannot be
aborted or committed while we're decoding it, until we finish decoding
it. So we lock it, then check if it's already aborted/already
committed/in progress. If it's aborted, treat it like any normal
aborted xact. If it's committed, treat it like any normal committed
xact. If it's in progress, keep the lock and decode it.But that lock could need to be held for an unbounded period of time -
as long as decoding takes to complete - which seems pretty
undesirable.
This didn't seem to be too much of a problem when I read it.
Sure, the issue noted by Stas exists, but it requires
Alter-Abort-Alter for it to be a problem. Meaning that normal non-DDL
transactions do not have problems. Neither would a real-time system
that uses the decoded data to decide whether to commit or abort the
transaction; in that case there would never be an abort until after
decoding.
So I suggest we have a pre-prepare callback to ensure that the plugin
can decide whether to decode or not. We can pass information to the
plugin such as whether we have issued DDL in that xact or not. The
plugin can then decide how it wishes to handle it, so if somebody
doesn't like the idea of a lock then don't use one. The plugin is
already responsible for many things, so this is nothing new.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote:
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?
Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach.
If I’ll fail to do that during this time then I’ll just update this patch to decode
only non-ddl 2pc transactions as you suggested.
Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).That seems kind of hacky TBH. I didn't much like marking it as
committed then un-committing it.I think it's mostly an interface issue though. I'd rather say
SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
something, to make it clear what we're doing.
Yes, that will be less confusing. However there is no any kind of queue, so
SnapBuildStartPrepare / SnapBuildFinishPrepare should work too.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote:
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach.
If I’ll fail to do that during this time then I’ll just update this patch to decode
only non-ddl 2pc transactions as you suggested.
I wasn't suggesting not decoding them, but giving the plugin the
option of whether to proceed with decoding or not.
As Simon said, have a pre-decode-prepared callback that lets the
plugin get a lock on the 2pc xact if it wants, or say it doesn't want
to decode it until it commits.
That'd be useful anyway, so we can filter and only do decoding at
prepare transaction time of xacts the downstream wants to know about
before they commit.
Just as before I marking this transaction committed in snapbuilder, but after
decoding I delete this transaction from xip (which holds committed transactions
in case of historic snapshot).That seems kind of hacky TBH. I didn't much like marking it as
committed then un-committing it.I think it's mostly an interface issue though. I'd rather say
SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or
something, to make it clear what we're doing.Yes, that will be less confusing. However there is no any kind of queue, so
SnapBuildStartPrepare / SnapBuildFinishPrepare should work too.
Yeah, that's better.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 Mar 2017, at 16:39, Craig Ringer <craig@2ndquadrant.com> wrote:
On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote:
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach.
If I’ll fail to do that during this time then I’ll just update this patch to decode
only non-ddl 2pc transactions as you suggested.I wasn't suggesting not decoding them, but giving the plugin the
option of whether to proceed with decoding or not.As Simon said, have a pre-decode-prepared callback that lets the
plugin get a lock on the 2pc xact if it wants, or say it doesn't want
to decode it until it commits.That'd be useful anyway, so we can filter and only do decoding at
prepare transaction time of xacts the downstream wants to know about
before they commit.
Ah, got that. Okay.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 March 2017 at 21:47, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 20 Mar 2017, at 16:39, Craig Ringer <craig@2ndquadrant.com> wrote:
On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote:
I thought about having special field (or reusing one of the existing fields)
in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin
as Petr suggested. Then this logic can reside in ReorderBufferCommit().
However this is not solving problem with catcache, so I'm looking into it right now.OK, so this is only an issue if we have xacts that change the schema
of tables and also insert/update/delete to their heaps. Right?So, given that this is CF3 for Pg10, should we take a step back and
impose the limitation that we can decode 2PC with schema changes or
data row changes, but not both?Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach.
If I’ll fail to do that during this time then I’ll just update this patch to decode
only non-ddl 2pc transactions as you suggested.I wasn't suggesting not decoding them, but giving the plugin the
option of whether to proceed with decoding or not.As Simon said, have a pre-decode-prepared callback that lets the
plugin get a lock on the 2pc xact if it wants, or say it doesn't want
to decode it until it commits.That'd be useful anyway, so we can filter and only do decoding at
prepare transaction time of xacts the downstream wants to know about
before they commit.Ah, got that. Okay.
Any news here?
We're in the last week of the CF. If you have a patch that's nearly
ready or getting there, now would be a good time to post it for help
and input from others.
I would really like to get this in, but we're running out of time.
Even if you just post your snapshot management work, with the cosmetic
changes discussed above, that would be a valuable start.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27 March 2017 at 09:31, Craig Ringer <craig@2ndquadrant.com> wrote:
We're in the last week of the CF. If you have a patch that's nearly
ready or getting there, now would be a good time to post it for help
and input from others.I would really like to get this in, but we're running out of time.
Even if you just post your snapshot management work, with the cosmetic
changes discussed above, that would be a valuable start.
I'm going to pick up the last patch and:
* Ensure we only add the GID to xact records for 2pc commits and aborts
* Add separate callbacks for prepare, abort prepared, and commit
prepared (of xacts already processed during prepare), so we aren't
overloading the "commit" callback and don't have to create fake empty
transactions to pass to the commit callback;
* Add another callback to determine whether an xact should be
processed at PREPARE TRANSACTION or COMMIT PREPARED time.
* Rename the snapshot builder faux-commit stuff in the current patch
so it's clearer what's going on.
* Write tests covering DDL, abort-during-decode, etc
Some special care is needed for the callback that decides whether to
process a given xact as 2PC or not. It's called before PREPARE
TRANSACTION to decide whether to decode any given xact at prepare time
or wait until it commits. It's called again at COMMIT PREPARED time if
we crashed after we processed PREPARE TRANSACTION and advanced our
confirmed_flush_lsn such that we won't re-process the PREPARE
TRANSACTION again. Our restart_lsn might've advanced past it so we
never even decode it, so we can't rely on seeing it at all. It has
access to the xid, gid and invalidations, all of which we have at both
prepare and commit time, to make its decision from. It must have the
same result at prepare and commit time for any given xact. We can
probably use a cache in the reorder buffer to avoid the 2nd call on
commit prepared if we haven't crashed/reconnected between the two.
This proposal does not provide a way to safely decode a 2pc xact that
made catalog changes which may be aborted while being decoded. The
plugin must lock such an xact so that it can't be aborted while being
processed, or defer decoding until commit prepared. It can use the
invalidations for the commit to decide.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27 Mar 2017, at 12:26, Craig Ringer <craig@2ndquadrant.com> wrote:
On 27 March 2017 at 09:31, Craig Ringer <craig@2ndquadrant.com> wrote:
We're in the last week of the CF. If you have a patch that's nearly
ready or getting there, now would be a good time to post it for help
and input from others.I would really like to get this in, but we're running out of time.
Even if you just post your snapshot management work, with the cosmetic
changes discussed above, that would be a valuable start.I'm going to pick up the last patch and:
I’m heavily underestimated amount of changes there, but almost finished
and will send updated patch in several hours.
* Ensure we only add the GID to xact records for 2pc commits and aborts
And only during wal_level >= logical. Done.
Also patch adds origin info to prepares and aborts.
* Add separate callbacks for prepare, abort prepared, and commit
prepared (of xacts already processed during prepare), so we aren't
overloading the "commit" callback and don't have to create fake empty
transactions to pass to the commit callback;
Done.
* Add another callback to determine whether an xact should be
processed at PREPARE TRANSACTION or COMMIT PREPARED time.
Also done.
* Rename the snapshot builder faux-commit stuff in the current patch
so it's clearer what's going on.
Hm. Okay, i’ll leave that part to you.
* Write tests covering DDL, abort-during-decode, etc
I’ve extended test, but it is good to have some more.
Some special care is needed for the callback that decides whether to
process a given xact as 2PC or not. It's called before PREPARE
TRANSACTION to decide whether to decode any given xact at prepare time
or wait until it commits. It's called again at COMMIT PREPARED time if
we crashed after we processed PREPARE TRANSACTION and advanced our
confirmed_flush_lsn such that we won't re-process the PREPARE
TRANSACTION again. Our restart_lsn might've advanced past it so we
never even decode it, so we can't rely on seeing it at all. It has
access to the xid, gid and invalidations, all of which we have at both
prepare and commit time, to make its decision from. It must have the
same result at prepare and commit time for any given xact. We can
probably use a cache in the reorder buffer to avoid the 2nd call on
commit prepared if we haven't crashed/reconnected between the two.
Good point. Didn’t think about restart_lsn in case when we are skipping this
particular prepare (filter_prepared() -> true, in my terms). I think that should
work properly as it use the same code path as it was before, but I’ll look at it.
This proposal does not provide a way to safely decode a 2pc xact that
made catalog changes which may be aborted while being decoded. The
plugin must lock such an xact so that it can't be aborted while being
processed, or defer decoding until commit prepared. It can use the
invalidations for the commit to decide.
I had played with that two-pass catalog scan and it seems to be
working but after some time I realised that it is not useful for the main
case when commit/abort is generated after receiver side will answer to
prepares. Also that two-pass scan is a massive change in relcache.c and
genam.c (FWIW there were no problems with cache, but some problems
with index scan and handling one-to-many queries to catalog, e.g. table
with it fields)
Finally i decided to throw it and switched to filter_prepare callback and
passed there txn structure to allow access to has_catalog_changes field.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27 March 2017 at 17:53, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’m heavily underestimated amount of changes there, but almost finished
and will send updated patch in several hours.
Oh, brilliant! Please post whatever you have before you knock off for
the day anyway, even if it's just a WIP, so I can pick it up tomorrow
my time and poke at its tests etc.
I'm in Western Australia +0800 time, significantly ahead of you.
Done.
[snip]
Also done.
Great, time is short so that's fantastic.
I’ve extended test, but it is good to have some more.
I don't mind writing tests and I've done quite a bit with TAP now, so
happy to help there.
Some special care is needed for the callback that decides whether to
process a given xact as 2PC or not. It's called before PREPARE
TRANSACTION to decide whether to decode any given xact at prepare time
or wait until it commits. It's called again at COMMIT PREPARED time if
we crashed after we processed PREPARE TRANSACTION and advanced our
confirmed_flush_lsn such that we won't re-process the PREPARE
TRANSACTION again. Our restart_lsn might've advanced past it so we
never even decode it, so we can't rely on seeing it at all. It has
access to the xid, gid and invalidations, all of which we have at both
prepare and commit time, to make its decision from. It must have the
same result at prepare and commit time for any given xact. We can
probably use a cache in the reorder buffer to avoid the 2nd call on
commit prepared if we haven't crashed/reconnected between the two.Good point. Didn’t think about restart_lsn in case when we are skipping this
particular prepare (filter_prepared() -> true, in my terms). I think that should
work properly as it use the same code path as it was before, but I’ll look at it.
I suspect that's going to be fragile in the face of interleaving of
xacts if we crash between prepare and commit prepared. (Apologies if
the below is long or disjointed, it's been a long day but trying to
sort thoughts out).
Consider ("SSU" = "standby status update"):
0/050 xid 1 BEGIN
0/060 xid 1 INSERT ...
0/070 xid 2 BEGIN
0/080 xid 2 INSERT ...
0/090 xid 3 BEGIN
0/095 xid 3 INSERT ...
0/100 xid 3 PREPARE TRANSACTION 'x' => sent to client [y/n]?
SSU: confirmed_flush_lsn = 0/100, restart_lsn 0/050 (if we sent to client)
0/200 xid 2 COMMIT => sent to client
SSU: confirmed_flush_lsn = 0/200, restart_lsn 0/050
0/250 xl_running_xacts logged, xids = [1,3]
[CRASH or disconnect/reconnect]
Restart decoding at 0/050.
skip output of xid 3 PREPARE TRANSACTION @ 0/100: is <= confirmed_flush_lsn
skip output of xid 2 COMMIT @ 0/200: is <= confirmed_flush_lsn
0/300 xid 2 COMMIT PREPARED 'x' => sent to client, confirmed_flush_lsn
is > confirmed_flush_lsn
In the above, our problem is that restart_lsn is held down by some
other xact, so we can't rely on it to tell us if we replayed xid 3 to
the output plugin or not. We can't use confirmed_flush_lsn either,
since it'll advance at xid 2's commit whether or not we replayed xid
3's prepare to the client.
Since xid 3 will still be in xl_running_xacts when prepared, when we
recover SnapBuildProcessChange will return true for its changes and
we'll (re)buffer them, whether or not we landed up sending to the
client at prepare time. Nothing much to be done about that, we'll just
discard them when we process the prepare or the commit prepared,
depending on where we consult our filter callback again.
We MUST ask our filter callback again though, before we test
SnapBuildXactNeedsSkip when processing the PREPARE TRANSACTION again.
Otherwise we'll discard the buffered changes, and if we *didn't* send
them to the client already ... splat.
We can call the filter callback again on xid 3's prepare to find out
"would you have replayed it when we passed it last time". Or we can
call it when we get to the commit instead, to ask "when called last
time at prepare, did you replay or not?" But we have to consult the
callback. By default we'd just skip ReorderBufferCommit processing for
xid 3 entirely, which we'll do via the SnapBuildXactNeedsSkip call in
DecodeCommit when we process the COMMIT PREPARED.
If there was no other running xact when we decoded the PREPARE
TRANSACTION the first time around (i.e. xid 1 and 2 didn't exist in
the above), and if we do send it to the client at prepare time, I
think we can safely advance restart_lsn to the most recent
xl_running_xacts once we get replay confirmation. So we can pretend we
already committed at PREPARE TRANSACTION time for restart purposes if
we output at PREPARE TRANSACTION time, it just doesn't help us with
deciding whether to send the buffer contents at COMMIT PREPARED time
or not.
TL;DR: we can't rely on restart_lsn or confirmed_flush_lsn or
xl_running_xacts, we must ask the filter callback when we (re)decode
the PREPARE TRANSACTION record and/or at COMMIT PREPARED time.
This isn't a big deal. We just have to make sure we consult the filter
callback again when we decode an already-confirmed prepare
transaction, or at commit prepared time if we don't know what its
result was already.
This proposal does not provide a way to safely decode a 2pc xact that
made catalog changes which may be aborted while being decoded. The
plugin must lock such an xact so that it can't be aborted while being
processed, or defer decoding until commit prepared. It can use the
invalidations for the commit to decide.I had played with that two-pass catalog scan and it seems to be
working but after some time I realised that it is not useful for the main
case when commit/abort is generated after receiver side will answer to
prepares. Also that two-pass scan is a massive change in relcache.c and
genam.c (FWIW there were no problems with cache, but some problems
with index scan and handling one-to-many queries to catalog, e.g. table
with it fields)
Yeah, it was the intrusiveness I was concerned about. I don't think we
can even remotely hope to do that for Pg 10.
Finally i decided to throw it and switched to filter_prepare callback and
passed there txn structure to allow access to has_catalog_changes field.
I think that's how we'll need to go.
Plugins can either defer processing on all 2pc xacts with catalog
changes, or lock the xact. It's not perfect, but it's far from
unreasonable when you consider that plugins would only be locking 2pc
xacts where they expect the result of logical decoding to influence
the commit/abort decision, so we won't be doing a commit/abort until
we finish decoding the prepare anyway.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27 Mar 2017, at 16:29, Craig Ringer <craig@2ndquadrant.com> wrote:
On 27 March 2017 at 17:53, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
I’m heavily underestimated amount of changes there, but almost finished
and will send updated patch in several hours.Oh, brilliant! Please post whatever you have before you knock off for
the day anyway, even if it's just a WIP, so I can pick it up tomorrow
my time and poke at its tests etc.
Ok, here it is.
Major differences comparing to previous version:
* GID is stored to commit/abort records only when wal_level >= logical.
* More consistency about storing and parsing origin info. Now it
is stored in prepare and abort records when repsession is active.
* Some clenup, function renames to get rid of xact_even/gid fields
in ReorderBuffer which i used only to copy them ReorderBufferTXN.
* Changed output plugin interface to one that was suggested upthread.
Now prepare/CP/AP is separate callback, and if none of them is set
then 2pc tx will be decoded as 1pc to provide back-compatibility.
* New callback filter_prepare() that can be used to switch between
1pc/2pc style of decoding 2pc tx.
* test_decoding uses new API and filters out aborted and running prepared tx.
It is actually easy to move unlock of 2PCState there to prepare callback to allow
decode of running tx, but since that extension is example ISTM that is better not to
hold that lock there during whole prepare decoding. However I leaved
enough information there about this and about case when that locks are not need at all
(when we are coordinating this tx).
Talking about locking of running prepared tx during decode, I think better solution
would be to use own custom lock here and register XACT_EVENT_PRE_ABORT
callback in extension to conflict with this lock. Decode should hold it in shared way,
while commit in excluseve. That will allow to lock stuff granularly ang block only
tx that is being decoded.
However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to
add it. Should I?
* It is actually doesn’t pass one of mine regression tests. I’ve added expected output
as it should be. I’ll try to send follow up message with fix, but right now sending it
as is, as you asked.
Attachments:
logical_twophase.diffapplication/octet-stream; name=logical_twophase.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..8df7c24 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -39,7 +39,7 @@ INSERT INTO test_prepared2 VALUES (9);
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
-- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
BEGIN
@@ -66,6 +66,40 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
COMMIT
(22 rows)
+-- same but with twophase decoding
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1');
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE 'test_prepared#2'
+ ABORT PREPARED 'test_prepared#2'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE 'test_prepared#3'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+ COMMIT PREPARED 'test_prepared#3'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(27 rows)
+
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..0647efd 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -45,6 +45,9 @@ DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
-- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-SELECT pg_drop_replication_slot('regression_slot');
+-- same but with twophase decoding
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
\ No newline at end of file
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 21cfd67..6413085 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +71,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +101,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +129,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
ctx->output_plugin_private = data;
@@ -176,6 +199,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +266,142 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special treatment.
+ *
+ * Right now we don't have a save way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of decoding.
+ *
+ * That kind of problem arises only when we are trying to retrospectively
+ * decode aborted transactions. If one wants to code distributed commit
+ * based on prepare decoding then commits/aborts will happend strictly after
+ * decoding will be completed, so it is safe to skip any checks/locks here.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity we just ignore in-progess transaction
+ * in this extension, as they may abort during deconing.
+ *
+ * It is possible to move that LWLockRelease() to pg_decode_prepare_txn()
+ * and allow decoding of running prepared tx, but such lock will prevent
+ * any 2pc transaction commit during decoding time, that can be big
+ * enough in case of massive changes/inserts in that tx.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is no
+ * mush sence in doing something with this transaction.
+ * Consequent ABORT PREPARED will be suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..ed75503 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..b58b9a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -129,7 +129,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -187,12 +186,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -854,7 +855,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -870,6 +871,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1021,6 +1024,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1031,6 +1035,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1061,9 +1080,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1239,6 +1268,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1392,11 +1458,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2055,7 +2122,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2082,7 +2150,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2144,7 +2212,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2166,7 +2235,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..9e407d5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1233,7 +1233,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1585,7 +1585,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3471,7 +3472,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5110,7 +5111,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5122,6 +5124,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5184,6 +5187,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5234,8 +5244,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5255,15 +5270,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5275,7 +5294,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5299,6 +5317,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5313,6 +5356,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5330,8 +5376,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..52e701b 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -607,9 +626,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ } else {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +710,28 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ parsed->dbId == ctx->slot->data.database &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9a66194 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -122,6 +130,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -179,8 +188,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -650,6 +676,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -684,6 +797,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b437799..2b2027b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1308,21 +1308,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
/* unknown transaction, nothing to replay */
if (txn == NULL)
return;
@@ -1605,8 +1602,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn->prepared)
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1668,6 +1668,106 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ txn->prepared = true;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ return txn == NULL ? true : txn->prepared;
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ rb->commit_prepared(rb, txn, commit_lsn);
+ else
+ rb->abort_prepared(rb, txn, commit_lsn);
+
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..c1ca998 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -885,7 +885,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1118,6 +1118,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->running.xip,
+ builder->running.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->running.xip + builder->running.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->running.xcnt--;
+
+ /* update min/max */
+ builder->running.xmin = builder->running.xip[0];
+ builder->running.xmax = builder->running.xip[builder->running.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b2b7848..6c0445a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..e8bf39b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -157,6 +161,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -303,13 +308,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -320,6 +352,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -351,7 +387,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -385,12 +421,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..7352b07 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -75,6 +75,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -109,5 +114,4 @@ extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 08e962d..be32774 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 17e47b3..99aa17f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,16 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
+ /*
+ * By using filter_prepare() callback we can force decoding to treat
+ * two-phase transaction as on ordinary one. This flag is set if we are
+ * actually called prepape() callback in output plugin.
+ */
+ bool prepared;
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -283,6 +294,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -318,6 +352,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -373,6 +411,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -396,6 +439,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..400ffe1 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -72,6 +72,10 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
Hi,
On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
Ok, here it is.
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 Mar 2017, at 00:19, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
* It is actually doesn’t pass one of mine regression tests. I’ve added expected output
as it should be. I’ll try to send follow up message with fix, but right now sending it
as is, as you asked.
Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare.
So it pass provided regression tests right now.
I’ll give it more testing tomorrow and going to write TAP test to check behaviour
when we loose info whether prepare was sent to subscriber or not.
Attachments:
logical_twophase.diffapplication/octet-stream; name=logical_twophase.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..74f2114 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -39,7 +39,7 @@ INSERT INTO test_prepared2 VALUES (9);
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
-- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
BEGIN
@@ -66,6 +66,40 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
COMMIT
(22 rows)
+-- same but with twophase decoding
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1');
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE 'test_prepared#2'
+ ABORT PREPARED 'test_prepared#2'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE 'test_prepared#3'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+ COMMIT PREPARED 'test_prepared#3'
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(28 rows)
+
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..0647efd 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -45,6 +45,9 @@ DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
-- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-SELECT pg_drop_replication_slot('regression_slot');
+-- same but with twophase decoding
+SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
\ No newline at end of file
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 21cfd67..6413085 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +71,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +101,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +129,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
ctx->output_plugin_private = data;
@@ -176,6 +199,17 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +266,142 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special treatment.
+ *
+ * Right now we don't have a save way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of decoding.
+ *
+ * That kind of problem arises only when we are trying to retrospectively
+ * decode aborted transactions. If one wants to code distributed commit
+ * based on prepare decoding then commits/aborts will happend strictly after
+ * decoding will be completed, so it is safe to skip any checks/locks here.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity we just ignore in-progess transaction
+ * in this extension, as they may abort during deconing.
+ *
+ * It is possible to move that LWLockRelease() to pg_decode_prepare_txn()
+ * and allow decoding of running prepared tx, but such lock will prevent
+ * any 2pc transaction commit during decoding time, that can be big
+ * enough in case of massive changes/inserts in that tx.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is no
+ * mush sence in doing something with this transaction.
+ * Consequent ABORT PREPARED will be suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED '%s'", txn->gid);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..ed75503 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..b58b9a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -129,7 +129,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -187,12 +186,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -854,7 +855,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -870,6 +871,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1021,6 +1024,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1031,6 +1035,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1061,9 +1080,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1239,6 +1268,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1392,11 +1458,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2055,7 +2122,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2082,7 +2150,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2144,7 +2212,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2166,7 +2235,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..9e407d5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1233,7 +1233,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1585,7 +1585,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3471,7 +3472,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5110,7 +5111,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5122,6 +5124,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5184,6 +5187,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5234,8 +5244,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5255,15 +5270,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5275,7 +5294,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5299,6 +5317,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5313,6 +5356,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5330,8 +5376,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b1e39c55 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -551,8 +570,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * this function were called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +631,81 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +717,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9a66194 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -122,6 +130,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -179,8 +188,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -650,6 +676,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -684,6 +797,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b437799..f633523 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1308,21 +1308,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
/* unknown transaction, nothing to replay */
if (txn == NULL)
return;
@@ -1605,8 +1602,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn->prepared)
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1633,8 +1633,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate or postpone that
+ * till the finish of two-phase tx
+ */
+ if (!txn->prepared)
+ ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
{
@@ -1668,6 +1672,111 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ txn->prepared = true;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then presumably subscriber confirmed prepare
+ * but we are rebooted.
+ */
+ return txn == NULL ? true : txn->prepared;
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ rb->commit_prepared(rb, txn, commit_lsn);
+ else
+ rb->abort_prepared(rb, txn, commit_lsn);
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..c1ca998 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -885,7 +885,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1118,6 +1118,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->running.xip,
+ builder->running.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->running.xip + builder->running.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->running.xcnt--;
+
+ /* update min/max */
+ builder->running.xmin = builder->running.xip[0];
+ builder->running.xmax = builder->running.xip[builder->running.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b2b7848..6c0445a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..e8bf39b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -157,6 +161,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -303,13 +308,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -320,6 +352,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -351,7 +387,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -385,12 +421,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..7352b07 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -75,6 +75,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -109,5 +114,4 @@ extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 08e962d..be32774 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 17e47b3..99aa17f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,16 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
+ /*
+ * By using filter_prepare() callback we can force decoding to treat
+ * two-phase transaction as on ordinary one. This flag is set if we are
+ * actually called prepape() callback in output plugin.
+ */
+ bool prepared;
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -283,6 +294,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -318,6 +352,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -373,6 +411,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -396,6 +439,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..400ffe1 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -72,6 +72,10 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
On 28 March 2017 at 05:25, Andres Freund <andres@anarazel.de> wrote:
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?
Yeah, that's a problem and one we discussed in the past, though I lost
track of it in amongst the recent work.
I'm currently writing a few TAP tests intended to check this sort of
thing, mixed DDL/DML, overlapping xacts, interleaved prepared xacts,
etc. If they highlight problems they'll be useful for the next
iteration of this patch anyway.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 March 2017 at 08:50, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 28 Mar 2017, at 00:19, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
* It is actually doesn’t pass one of mine regression tests. I’ve added expected output
as it should be. I’ll try to send follow up message with fix, but right now sending it
as is, as you asked.Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare.
So it pass provided regression tests right now.
I’ll give it more testing tomorrow and going to write TAP test to check behaviour
when we loose info whether prepare was sent to subscriber or not.
Great, thanks. I'll try to have some TAP tests ready.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
Ok, here it is.
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?
But why is that deadlock? Seems as just lock.
In case of prepared lock of pg_class decoding will wait until it committed and
then continue to decode. As well as anything in postgres that accesses pg_class,
including inability to connect to database and bricking database if you accidentally
disconnected before committing that tx (as you showed me some while ago :-).
IMO it is issue of being able to prepare such lock, than of decoding.
Is there any other scenarios where catalog readers are blocked except explicit lock
on catalog table? Alters on catalogs seems to be prohibited.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
Ok, here it is.
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?But why is that deadlock? Seems as just lock.
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...
Is there any other scenarios where catalog readers are blocked except explicit lock
on catalog table? Alters on catalogs seems to be prohibited.
VACUUM FULL on catalog tables (but that can't happen in xact => 2pc)
CLUSTER on catalog tables (can happen in xact)
ALTER on tables modified in the same transaction (even of non catalog
tables!), because a lot of routines will do a heap_open() to get the
tupledesc etc.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 March 2017 at 02:25, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
Ok, here it is.
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?But why is that deadlock? Seems as just lock.
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...
Surely that's up to the decoding plugin?
If the plugin takes locks it had better make sure it can get the locks
or timeout. But that's true of any resource the plugin needs access to
and can't obtain when needed.
This issue could occur now if the transaction tool a session lock on a
catalog table.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-03-28 03:30:28 +0100, Simon Riggs wrote:
On 28 March 2017 at 02:25, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote:
On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote:
Ok, here it is.
On a very quick skim, this doesn't seem to solve the issues around
deadlocks of prepared transactions vs. catalog tables. What if the
prepared transaction contains something like LOCK pg_class; (there's a
lot more realistic examples)? Then decoding won't be able to continue,
until that transaction is committed / aborted?But why is that deadlock? Seems as just lock.
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...Surely that's up to the decoding plugin?
It can't do much about it, so not really. A lot of the functions
dealing with datatypes (temporarily) lock relations. Both the actual
user tables, and system catalog tables (cache lookups...).
If the plugin takes locks it had better make sure it can get the locks
or timeout. But that's true of any resource the plugin needs access to
and can't obtain when needed.
This issue could occur now if the transaction tool a session lock on a
catalog table.
That's not a self deadlock, and we don't don't do session locks outside
of operations like CIC?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote:
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...
In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.
Is there any other scenarios where catalog readers are blocked except explicit lock
on catalog table? Alters on catalogs seems to be prohibited.VACUUM FULL on catalog tables (but that can't happen in xact => 2pc)
CLUSTER on catalog tables (can happen in xact)
ALTER on tables modified in the same transaction (even of non catalog
tables!), because a lot of routines will do a heap_open() to get the
tupledesc etc.
Right, and the latter one is the main issue, since it's by far the
most likely and hard to just work around.
The tests Stas has in place aren't sufficient to cover this, as they
decode only after everything has committed. I'm expanding the
pg_regress coverage to do decoding between prepare and commit (when we
actually care) first, and will add some tests involving strong locks.
I've found one bug where it doesn't decode a 2pc xact at prepare or
commit time, even without restart or strong lock issues. Pretty sure
it's due to assumptions made about the filter callback.
The current code as used by test_decoding won't work correctly. If
txn->has_catalog_changes and if it's still in-progress, the filter
skips decoding at PREPARE time. But it isn't then decoded at COMMIT
PREPARED time either, if we processed past the PREPARE TRANSACTION.
Bug.
Also, by skipping decoding of 2pc xacts with catalog changes in this
test we also hide the locking issues.
However, even once I add an option to force decoding of 2pc xacts with
catalog changes to test_decoding, I cannot reproduce the expected
locking issues so far. See tests in attached updated version, in
contrib/test_decoding/sql/prepare.sql .
Haven't done any TAP tests yet, since the pg_regress tests are so far
sufficient to turn up issues.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
logical_twophase_v4.patchtext/x-patch; charset=US-ASCII; name=logical_twophase_v4.patchDownload
From e23909a9929b561e011f41891825bfb5b1ecb1b3 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Tue, 28 Mar 2017 10:51:48 +0800
Subject: [PATCH] Logical decoding of two-phase commit transactions at PREPARE
time
---
contrib/test_decoding/expected/prepared.out | 253 ++++++++++++++++++++----
contrib/test_decoding/sql/prepared.sql | 82 +++++++-
contrib/test_decoding/test_decoding.c | 205 ++++++++++++++++++-
src/backend/access/rmgrdesc/xactdesc.c | 37 +++-
src/backend/access/transam/twophase.c | 89 ++++++++-
src/backend/access/transam/xact.c | 72 ++++++-
src/backend/replication/logical/decode.c | 148 ++++++++++++--
src/backend/replication/logical/logical.c | 141 +++++++++++++
src/backend/replication/logical/reorderbuffer.c | 129 +++++++++++-
src/backend/replication/logical/snapbuild.c | 48 ++++-
src/include/access/twophase.h | 3 +
src/include/access/xact.h | 44 ++++-
src/include/replication/logical.h | 6 +-
src/include/replication/output_plugin.h | 36 ++++
src/include/replication/reorderbuffer.h | 50 +++++
src/include/replication/snapbuild.h | 4 +
16 files changed, 1254 insertions(+), 93 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..56c6e72 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_no2pc
+ data
+------
+(0 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+ data
+------
+(0 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,49 +91,169 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(3 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(3 rows)
+
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3';
+ COMMIT PREPARED 'test_prepared#3';
+(5 rows)
+
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+------
+(0 rows)
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+-- FIXME we expect a timeout here, but it actually works...
+ERROR: statement timed out
+
+RESET statement_timeout;
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data
--------------------------------------------------------------------------
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
- COMMIT
- BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:5
- table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
- COMMIT
- BEGIN
- table public.test_prepared2: INSERT: id[integer]:9
- COMMIT
-(22 rows)
-
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..a94503c 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,36 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+:get_no2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +41,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+:get_no2pc
+:get_with2pc
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+:get_no2pc
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 21cfd67..0f0bb1b 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,8 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +72,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +102,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +130,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
ctx->output_plugin_private = data;
@@ -176,6 +201,27 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +278,163 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special
+ * treatment.
+ *
+ * Right now we don't have a safe way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of
+ * decoding.
+ *
+ * That kind of problem arises only when we are trying to
+ * retrospectively decode aborted transactions with catalog changes -
+ * including if a transaction aborts while we're decoding it. If one
+ * wants to code distributed commit based on prepare decoding then
+ * commits/aborts will happend strictly after decoding will be
+ * completed, so it is possible to skip any checks/locks here.
+ *
+ * We'll also get stuck trying to acquire locks on catalog relations
+ * we need for decoding if the prepared xact holds a strong lock on
+ * one of them and we also need to decode row changes.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity, by default we just
+ * ignore in-progess prepared transactions with catalog
+ * changes in this extension. If they abort during
+ * decoding then tuples we need to decode them may be
+ * overwritten while we're still decoding, causing
+ * wrong catalog lookups.
+ *
+ * It is possible to move that LWLockRelease() to
+ * pg_decode_prepare_txn() and allow decoding of
+ * running prepared tx, but such lock will prevent any
+ * 2pc transaction commit during decoding time. That
+ * can be a long time in case of lots of
+ * changes/inserts in that tx or if the downstream is
+ * slow/unresonsive.
+ *
+ * (Continuing to decode without the lock is unsafe, XXX)
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return !data->twophase_decode_with_catalog_changes;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is
+ * not much sense in doing something with this
+ * transaction. Consequently ABORT PREPARED will be
+ * suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..ed75503 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..b58b9a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -129,7 +129,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -187,12 +186,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -854,7 +855,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -870,6 +871,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1021,6 +1024,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1031,6 +1035,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1061,9 +1080,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1239,6 +1268,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1392,11 +1458,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2055,7 +2122,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2082,7 +2150,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2144,7 +2212,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2166,7 +2235,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..9e407d5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1233,7 +1233,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1585,7 +1585,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3471,7 +3472,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5110,7 +5111,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5122,6 +5124,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5184,6 +5187,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5234,8 +5244,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5255,15 +5270,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5275,7 +5294,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5299,6 +5317,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5313,6 +5356,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5330,8 +5376,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b1e39c55 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -551,8 +570,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * this function were called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +631,81 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +717,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9a66194 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -122,6 +130,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -179,8 +188,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -650,6 +676,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -684,6 +797,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b437799..f633523 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1308,21 +1308,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
/* unknown transaction, nothing to replay */
if (txn == NULL)
return;
@@ -1605,8 +1602,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn->prepared)
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1633,8 +1633,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate or postpone that
+ * till the finish of two-phase tx
+ */
+ if (!txn->prepared)
+ ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
{
@@ -1668,6 +1672,111 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ txn->prepared = true;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then presumably subscriber confirmed prepare
+ * but we are rebooted.
+ */
+ return txn == NULL ? true : txn->prepared;
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ rb->commit_prepared(rb, txn, commit_lsn);
+ else
+ rb->abort_prepared(rb, txn, commit_lsn);
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..c1ca998 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -885,7 +885,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1118,6 +1118,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->running.xip,
+ builder->running.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->running.xip + builder->running.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->running.xcnt--;
+
+ /* update min/max */
+ builder->running.xmin = builder->running.xip[0];
+ builder->running.xmax = builder->running.xip[builder->running.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b2b7848..6c0445a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..e8bf39b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -157,6 +161,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -303,13 +308,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -320,6 +352,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -351,7 +387,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -385,12 +421,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..7352b07 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -75,6 +75,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -109,5 +114,4 @@ extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 08e962d..be32774 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 17e47b3..99aa17f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,16 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
+ /*
+ * By using filter_prepare() callback we can force decoding to treat
+ * two-phase transaction as on ordinary one. This flag is set if we are
+ * actually called prepape() callback in output plugin.
+ */
+ bool prepared;
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -283,6 +294,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -318,6 +352,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -373,6 +411,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -396,6 +439,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..400ffe1 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -72,6 +72,10 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
--
2.5.5
On 28 March 2017 at 10:53, Craig Ringer <craig@2ndquadrant.com> wrote:
However, even once I add an option to force decoding of 2pc xacts with
catalog changes to test_decoding, I cannot reproduce the expected
locking issues so far. See tests in attached updated version, in
contrib/test_decoding/sql/prepare.sql .
I haven't been able to create issues with CLUSTER, any ALTER TABLEs
I've tried, or anything similar.
An explicit AEL on pg_attribute causes the decoding stall, but you
can't do anything much else either, and I don't see how that'd arise
under normal circumstances.
If it's a sufficiently obscure issue I'm willing to document it as
"don't do that" or "use a command filter to prohibit that". But it's
more likely that I'm just not spotting the cases where the issue
arises.
Attempting to CLUSTER a system catalog like pg_class or pg_attribute
causes PREPARE TRANSACTION to fail with
ERROR: cannot PREPARE a transaction that modified relation mapping
and I didn't find any catalogs I could CLUSTER that'd also block decoding.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote:
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote:
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.
Yes, I understand.
The decoding plugin can choose to enable lock_timeout, or it can
choose to wait for manual resolution, or it could automatically abort
such a transaction to avoid needing to decode it.
I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.
This doesn't differ in any respect from any other resource it might
need yet cannot obtain.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote:
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote:
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.Yes, I understand.
The decoding plugin can choose to enable lock_timeout, or it can
choose to wait for manual resolution, or it could automatically abort
such a transaction to avoid needing to decode it.
That doesn't solve the problem. You still left with replication that
can't progress. I think that's completely unacceptable. We need a
proper solution to this, not throw our hands up in the air and hope that
it's not going to hurt a whole lot of peopel.
I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.
I don't think this is "plugin architecture" related. The output pluging
can't do right here, this has to be solved at a higher level.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 March 2017 at 15:38, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote:
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote:
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.Yes, I understand.
The decoding plugin can choose to enable lock_timeout, or it can
choose to wait for manual resolution, or it could automatically abort
such a transaction to avoid needing to decode it.That doesn't solve the problem. You still left with replication that
can't progress. I think that's completely unacceptable. We need a
proper solution to this, not throw our hands up in the air and hope that
it's not going to hurt a whole lot of peopel.
Nobody is throwing their hands in the air, nobody is just hoping. The
concern raised is real and needs to be handled somewhere; the only
point of discussion is where and how.
I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.I don't think this is "plugin architecture" related. The output pluging
can't do right here, this has to be solved at a higher level.
That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.
You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.
Bottom line here is we shouldn't reject this patch on this point,
especially since any resource issue found during decoding could
similarly prevent progress with decoding.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
On 28 March 2017 at 15:38, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote:
On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote:
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote:
If you actually need separate decoding of 2PC, then you want to wait for
the PREPARE to be replicated. If that replication has to wait for the
to-be-replicated prepared transaction to commit prepared, and commit
prepare will only happen once replication happened...In other words, the output plugin cannot decode a transaction at
PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on
a catalog relation we must be able to read in order to decode the
xact.Yes, I understand.
The decoding plugin can choose to enable lock_timeout, or it can
choose to wait for manual resolution, or it could automatically abort
such a transaction to avoid needing to decode it.That doesn't solve the problem. You still left with replication that
can't progress. I think that's completely unacceptable. We need a
proper solution to this, not throw our hands up in the air and hope that
it's not going to hurt a whole lot of peopel.Nobody is throwing their hands in the air, nobody is just hoping. The
concern raised is real and needs to be handled somewhere; the only
point of discussion is where and how.
I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.I don't think this is "plugin architecture" related. The output pluging
can't do right here, this has to be solved at a higher level.That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.
Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc). Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.
You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.
It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.
Bottom line here is we shouldn't reject this patch on this point,
I think it definitely has to be rejected because of that. And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.
One way to fix this would be to allow decoding to acquire such locks
(i.e. locks held by the prepared xact we're decoding) - there
unfortunately are some practical issues with that (e.g. the locking code
doesnt' necessarily expect a second non-exclusive locker, when there's
an exclusive one), or we could add an exception to the locking code to
simply not acquire such locks.
especially since any resource issue found during decoding could
similarly prevent progress with decoding.
For example?
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 Mar. 2017 23:08, "Andres Freund" <andres@anarazel.de> wrote:
I don't think its for us to say what the plugin is allowed to do. We
decided on a plugin architecture, so we have to trust that the plugin
author resolves the issues. We can document them so those choices are
clear.I don't think this is "plugin architecture" related. The output pluging
can't do right here, this has to be solved at a higher level.That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc).
IMO, if it's a rare condition and we can abort decoding then recover
cleanly and succeed on retry, that's OK. Not dissimilar to the deadlock
detector. But right now that's not the case, it's possible (however
artificially) to create prepared xacts for which decoding will stall and
not succeed.
Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.
I agree, we can't rely on anything for which the only way to continue is to
rollback a prepared xact.
You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.
I agree that it shouldn't, and in fact DDL is the main part of why I want
2PC decoding.
What's surprised me is that I haven't actually been able to create any
situations where, with test_decoding, we have such a failure. Not unless I
manually LOCK TABLE pg_attribute, anyway.
Notably, we already disallow prepared xacts that make changes to the
relfilenodemap, which covers a lot of the problem cases like CLUSTERing
system tables.
Bottom line here is we shouldn't reject this patch on this point,
I think it definitely has to be rejected because of that. And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.
Yes, and I lost track of it while focusing on the catalog tuple visibility
issues. I warned Stas of this issue when he first mentioned an interest in
decoding of 2PC actually, but haven't kept a proper eye on it since.
Andres and I even discussed this back in the early BDR days, it's not new
and is part of why I poked Stas to try some DDL tests etc. The tests in the
initial patch didn't have enough coverage to trigger any issues - they
didn't actually test decoding of a 2pc xact while it was still in-progress
at all. But even once I added more tests I've actually been unable to
reproduce this in a realistic real world example.
Frankly I'm confused by that, since I would expect an AEL on some_table to
cause decoding of some_table to get stuck. It does not.
That doesn't mean we should accept failure cases and commit something with
holes in it. But it might inform our choices about how we solve those
issues.
One way to fix this would be to allow decoding to acquire such locks
(i.e. locks held by the prepared xact we're decoding) - there
unfortunately are some practical issues with that (e.g. the locking code
doesnt' necessarily expect a second non-exclusive locker, when there's
an exclusive one), or we could add an exception to the locking code to
simply not acquire such locks.
I've been meaning to see if we can use the parallel infrastructure's
session leader infrastructure for this, by making the 2pc fake-proc a
leader and making our decoding session inherit its locks. I haven't dug
into it to see if it's even remotely practical yet, and won't be able to
until early pg11.
We could proceed with the caveat that decoding plugins that use 2pc support
must defer decoding of 2pc xacts containing ddl until commit prepared, or
must take responsibility for ensuring (via a command filter, etc) that
xacts are safe to decode and 2pc lock xacts during decoding. But we're
likely to change the interface for all that when we iterate for pg11 and
I'd rather not carry more BC than we have to. Also, the patch has unsolved
issues with how it keeps track of whether an xact was output at prepare
time or not and suppresses output at commit time.
I'm inclined to shelve the patch for Pg 10. We've only got a couple of days
left, the tests are still pretty minimal. We have open issues around
locking, less than totally satisfactory abort handling, and potential to
skip replay of transactions for both prepare and commit prepared. It's not
ready to go. However, it's definitely to the point where with a little more
work it'll be practical to patch into variants of Pg until we can
mainstream it in Pg 11, which is nice.
--
Craig Ringer
On 28 Mar 2017, at 18:08, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc). Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.Bottom line here is we shouldn't reject this patch on this point,
I think it definitely has to be rejected because of that. And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.
Okay. In order to find more realistic cases that blocks replication
i’ve created following setup:
* in backend: tests_decoding plugins hooks on xact events and utility
statement hooks and transform each commit into prepare, then sleeps
on latch. If transaction contains DDL that whole statement pushed in
wal as transactional message. If DDL can not be prepared or disallows
execution in transaction block than it goes as nontransactional logical
message without prepare/decode injection. If transaction didn’t issued any
DDL and didn’t write anything to wal, then it skips 2pc too.
* after prepare is decoded, output plugin in walsender unlocks backend
allowing to proceed with commit prepared. So in case when decoding
tries to access blocked catalog everything should stop.
* small python script that consumes decoded wal from walsender (thanks
Craig and Petr)
After small acrobatics with that hooks I’ve managed to run whole
regression suite in parallel mode through such setup of test_decoding
without any deadlocks. I’ve added two xact_events to postgres and
allowedn prepare of transactions that touched temp tables since
they are heavily used in tests and creates a lot of noise in diffs.
So it boils down to 3 failed regression tests out of 177, namely:
* transactions.sql — here commit of tx stucks with obtaining SafeSnapshot().
I didn’t look what is happening there specifically, but just checked that
walsender isn’t blocked. I’m going to look more closely at this.
* prepared_xacts.sql — here select prepared_xacts() sees our prepared
tx. It is possible to filter them out, but obviously it works as expected.
* guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx
from being prepared. I didn’t found the way to check presence of
pendingActions outside of async.c so decided to leave it as is.
It seems that at least in regression tests nothing can block twophase
logical decoding. Is that strong enough argument to hypothesis that current
approach doesn’t creates deadlock except locks on catalog which should be
disallowed anyway?
Patches attached. logical_twophase_v5 is slightly modified version of previous
patch merged with Craig’s changes. Second file is set of patches over previous
one, that implements logic i’ve just described. There is runtest.sh script that
setups postgres, runs python logical consumer in background and starts
regression test.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
logical_twophase_v5.diffapplication/octet-stream; name=logical_twophase_v5.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..56c6e72 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_no2pc
+ data
+------
+(0 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+ data
+------
+(0 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,49 +91,169 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data
--------------------------------------------------------------------------
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3';
+ COMMIT PREPARED 'test_prepared#3';
+(5 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_no2pc
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+------
+(0 rows)
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+-- FIXME we expect a timeout here, but it actually works...
+ERROR: statement timed out
+
+RESET statement_timeout;
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..a94503c 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,36 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+:get_no2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +41,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+:get_no2pc
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 21cfd67..0f0bb1b 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,8 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +72,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +102,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +130,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
ctx->output_plugin_private = data;
@@ -176,6 +201,27 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +278,163 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special
+ * treatment.
+ *
+ * Right now we don't have a safe way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of
+ * decoding.
+ *
+ * That kind of problem arises only when we are trying to
+ * retrospectively decode aborted transactions with catalog changes -
+ * including if a transaction aborts while we're decoding it. If one
+ * wants to code distributed commit based on prepare decoding then
+ * commits/aborts will happend strictly after decoding will be
+ * completed, so it is possible to skip any checks/locks here.
+ *
+ * We'll also get stuck trying to acquire locks on catalog relations
+ * we need for decoding if the prepared xact holds a strong lock on
+ * one of them and we also need to decode row changes.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity, by default we just
+ * ignore in-progess prepared transactions with catalog
+ * changes in this extension. If they abort during
+ * decoding then tuples we need to decode them may be
+ * overwritten while we're still decoding, causing
+ * wrong catalog lookups.
+ *
+ * It is possible to move that LWLockRelease() to
+ * pg_decode_prepare_txn() and allow decoding of
+ * running prepared tx, but such lock will prevent any
+ * 2pc transaction commit during decoding time. That
+ * can be a long time in case of lots of
+ * changes/inserts in that tx or if the downstream is
+ * slow/unresonsive.
+ *
+ * (Continuing to decode without the lock is unsafe, XXX)
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return !data->twophase_decode_with_catalog_changes;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is
+ * not much sense in doing something with this
+ * transaction. Consequently ABORT PREPARED will be
+ * suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..ed75503 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..b58b9a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -129,7 +129,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -187,12 +186,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -854,7 +855,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -870,6 +871,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1021,6 +1024,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1031,6 +1035,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1061,9 +1080,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1239,6 +1268,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1392,11 +1458,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2055,7 +2122,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2082,7 +2150,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2144,7 +2212,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2166,7 +2235,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..9e407d5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1233,7 +1233,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1585,7 +1585,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3471,7 +3472,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5110,7 +5111,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5122,6 +5124,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5184,6 +5187,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5234,8 +5244,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5255,15 +5270,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5275,7 +5294,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5299,6 +5317,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5313,6 +5356,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5330,8 +5376,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b1e39c55 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -551,8 +570,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * this function were called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +631,81 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +717,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9a66194 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -122,6 +130,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -179,8 +188,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -650,6 +676,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -684,6 +797,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b437799..0501033 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1308,25 +1308,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1605,8 +1598,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn->prepared)
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1633,8 +1629,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate or postpone that
+ * till the finish of two-phase tx
+ */
+ if (!txn->prepared)
+ ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
{
@@ -1668,6 +1668,119 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->prepared = true;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then presumably subscriber confirmed prepare
+ * but we are rebooted.
+ */
+ return txn == NULL ? true : txn->prepared;
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ rb->commit_prepared(rb, txn, commit_lsn);
+ else
+ rb->abort_prepared(rb, txn, commit_lsn);
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..c1ca998 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -885,7 +885,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1118,6 +1118,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->running.xip,
+ builder->running.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->running.xip + builder->running.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->running.xcnt--;
+
+ /* update min/max */
+ builder->running.xmin = builder->running.xip[0];
+ builder->running.xmax = builder->running.xip[builder->running.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b2b7848..6c0445a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..e8bf39b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -157,6 +161,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -303,13 +308,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -320,6 +352,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -351,7 +387,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -385,12 +421,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..7352b07 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -75,6 +75,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -109,5 +114,4 @@ extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 08e962d..be32774 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 17e47b3..99aa17f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,16 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
+ /*
+ * By using filter_prepare() callback we can force decoding to treat
+ * two-phase transaction as on ordinary one. This flag is set if we are
+ * actually called prepape() callback in output plugin.
+ */
+ bool prepared;
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -283,6 +294,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -318,6 +352,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -373,6 +411,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -396,6 +439,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..400ffe1 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -72,6 +72,10 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
logical_twophase_regresstest.diffapplication/octet-stream; name=logical_twophase_regresstest.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0f0bb1b..8e76c55 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -34,6 +34,19 @@
#include "utils/syscache.h"
#include "utils/typcache.h"
+#include "access/xact.h"
+#include "miscadmin.h"
+#include "executor/executor.h"
+#include "nodes/nodes.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/ipc.h"
+#include "pgstat.h"
+#include "tcop/utility.h"
+#include "commands/portalcmds.h"
+
PG_MODULE_MAGIC;
/* These must be available to pg_dlsym() */
@@ -85,11 +98,232 @@ static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);
+static void test_decoding_xact_callback(XactEvent event, void *arg);
+
+static void test_decoding_process_utility(PlannedStmt *pstmt,
+ const char *queryString, ProcessUtilityContext context,
+ ParamListInfo params, DestReceiver *dest, char *completionTag);
+
+static bool test_decoding_twophase_commit();
+
+static void test_decoding_executor_finish(QueryDesc *queryDesc);
+
+static ProcessUtility_hook_type PreviousProcessUtilityHook;
+
+static ExecutorFinish_hook_type PreviousExecutorFinishHook;
+
+static bool CurrentTxContainsDML;
+static bool CurrentTxContainsDDL;
+static bool CurrentTxNonpreparable;
void
_PG_init(void)
{
- /* other plugins can perform things here */
+ PreviousExecutorFinishHook = ExecutorFinish_hook;
+ ExecutorFinish_hook = test_decoding_executor_finish;
+
+ PreviousProcessUtilityHook = ProcessUtility_hook;
+ ProcessUtility_hook = test_decoding_process_utility;
+
+ if (!IsUnderPostmaster)
+ RegisterXactCallback(test_decoding_xact_callback, NULL);
+}
+
+
+/* ability to hook into sigle-statement transaction */
+static void
+test_decoding_xact_callback(XactEvent event, void *arg)
+{
+ switch (event)
+ {
+ case XACT_EVENT_START:
+ case XACT_EVENT_ABORT:
+ CurrentTxContainsDML = false;
+ CurrentTxContainsDDL = false;
+ CurrentTxNonpreparable = false;
+ break;
+ case XACT_EVENT_COMMIT_COMMAND:
+ if (!IsTransactionBlock())
+ test_decoding_twophase_commit();
+ break;
+ default:
+ break;
+ }
+}
+
+/* find out whether transaction had wrote any data or not */
+static void
+test_decoding_executor_finish(QueryDesc *queryDesc)
+{
+ CmdType operation = queryDesc->operation;
+ EState *estate = queryDesc->estate;
+ if (estate->es_processed != 0 &&
+ (operation == CMD_INSERT || operation == CMD_UPDATE || operation == CMD_DELETE))
+ {
+ int i;
+ for (i = 0; i < estate->es_num_result_relations; i++)
+ {
+ Relation rel = estate->es_result_relations[i].ri_RelationDesc;
+ if (RelationNeedsWAL(rel)) {
+ CurrentTxContainsDML = true;
+ break;
+ }
+ }
+ }
+
+ if (PreviousExecutorFinishHook != NULL)
+ PreviousExecutorFinishHook(queryDesc);
+ else
+ standard_ExecutorFinish(queryDesc);
+}
+
+
+/*
+ * Several things here:
+ * 1) hook into commit of transaction block
+ * 2) write logical message for DDL (default path)
+ * 3) prevent 2pc hook for tx that can not be prepared and
+ * send them as logical nontransactional message.
+ */
+static void
+test_decoding_process_utility(PlannedStmt *pstmt,
+ const char *queryString, ProcessUtilityContext context,
+ ParamListInfo params, DestReceiver *dest, char *completionTag)
+{
+ Node *parsetree = pstmt->utilityStmt;
+
+ switch (nodeTag(parsetree))
+ {
+ case T_TransactionStmt:
+ {
+ TransactionStmt *stmt = (TransactionStmt *) parsetree;
+ switch (stmt->kind)
+ {
+ case TRANS_STMT_COMMIT:
+ if (test_decoding_twophase_commit())
+ return; /* do not proceed */
+ break;
+ default:
+ break;
+ }
+ }
+ break;
+
+ /* cannot PREPARE a transaction that has executed LISTEN, UNLISTEN, or NOTIFY */
+ case T_NotifyStmt:
+ case T_ListenStmt:
+ case T_UnlistenStmt:
+ CurrentTxNonpreparable = true;
+ break;
+
+ /* create/reindex/drop concurrently can not be execuled in prepared tx */
+ case T_ReindexStmt:
+ {
+ ReindexStmt *stmt = (ReindexStmt *) parsetree;
+ switch (stmt->kind)
+ {
+ case REINDEX_OBJECT_SCHEMA:
+ case REINDEX_OBJECT_SYSTEM:
+ case REINDEX_OBJECT_DATABASE:
+ CurrentTxNonpreparable = true;
+ default:
+ break;
+ }
+ }
+ break;
+ case T_IndexStmt:
+ {
+ IndexStmt *indexStmt = (IndexStmt *) parsetree;
+ if (indexStmt->concurrent)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+ case T_DropStmt:
+ {
+ DropStmt *stmt = (DropStmt *) parsetree;
+ if (stmt->removeType == OBJECT_INDEX && stmt->concurrent)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+
+ /* cannot PREPARE a transaction that has created a cursor WITH HOLD */
+ case T_DeclareCursorStmt:
+ {
+ DeclareCursorStmt *stmt = (DeclareCursorStmt *) parsetree;
+ if (stmt->options & CURSOR_OPT_HOLD)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+
+ default:
+ LogLogicalMessage("D", queryString, strlen(queryString) + 1, true);
+ CurrentTxContainsDDL = true;
+ break;
+ }
+
+ /* Send non-transactional message then */
+ if (CurrentTxNonpreparable)
+ LogLogicalMessage("C", queryString, strlen(queryString) + 1, false);
+
+ if (PreviousProcessUtilityHook != NULL)
+ {
+ PreviousProcessUtilityHook(pstmt, queryString, context,
+ params, dest, completionTag);
+ }
+ else
+ {
+ standard_ProcessUtility(pstmt, queryString, context,
+ params, dest, completionTag);
+ }
+}
+
+/*
+ * Change commit to prepare and wait on latch.
+ * WalSender will unlock us after decoding and we can proceed.
+ */
+static bool
+test_decoding_twophase_commit()
+{
+ int result = 0;
+ char gid[20];
+
+ if (IsAutoVacuumLauncherProcess() ||
+ !IsNormalProcessingMode() ||
+ am_walsender ||
+ IsBackgroundWorker ||
+ IsAutoVacuumWorkerProcess() ||
+ IsAbortedTransactionBlockState() ||
+ !(CurrentTxContainsDML || CurrentTxContainsDDL) ||
+ CurrentTxNonpreparable )
+ return false;
+
+ snprintf(gid, sizeof(gid), "test_decoding:%d", MyProc->pgprocno);
+
+ if (!IsTransactionBlock())
+ {
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ }
+ if (!PrepareTransactionBlock(gid))
+ {
+ fprintf(stderr, "Can't prepare transaction '%s'\n", gid);
+ }
+ CommitTransactionCommand();
+
+ result = WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, 0,
+ WAIT_EVENT_REPLICATION_SLOT_SYNC);
+
+ if (result & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ if (result & WL_LATCH_SET)
+ ResetLatch(&MyProc->procLatch);
+
+
+ StartTransactionCommand();
+ FinishPreparedTransaction(gid, true);
+ return true;
}
/* specify output plugin callbacks */
@@ -297,74 +531,11 @@ static bool
pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
char *gid)
{
- TestDecodingData *data = ctx->output_plugin_private;
-
- /* treat all transaction as one-phase */
- if (!data->twophase_decoding)
+ /* decode only tx that are prepared by our hook */
+ if (strncmp(gid, "test_decoding:", 14) == 0)
+ return false;
+ else
return true;
-
- /*
- * Two-phase transactions that accessed catalog require special
- * treatment.
- *
- * Right now we don't have a safe way to decode catalog changes made in
- * prepared transaction that was already aborted by the time of
- * decoding.
- *
- * That kind of problem arises only when we are trying to
- * retrospectively decode aborted transactions with catalog changes -
- * including if a transaction aborts while we're decoding it. If one
- * wants to code distributed commit based on prepare decoding then
- * commits/aborts will happend strictly after decoding will be
- * completed, so it is possible to skip any checks/locks here.
- *
- * We'll also get stuck trying to acquire locks on catalog relations
- * we need for decoding if the prepared xact holds a strong lock on
- * one of them and we also need to decode row changes.
- */
- if (txn->has_catalog_changes)
- {
- LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
-
- if (TransactionIdIsInProgress(txn->xid))
- {
- /*
- * For the sake of simplicity, by default we just
- * ignore in-progess prepared transactions with catalog
- * changes in this extension. If they abort during
- * decoding then tuples we need to decode them may be
- * overwritten while we're still decoding, causing
- * wrong catalog lookups.
- *
- * It is possible to move that LWLockRelease() to
- * pg_decode_prepare_txn() and allow decoding of
- * running prepared tx, but such lock will prevent any
- * 2pc transaction commit during decoding time. That
- * can be a long time in case of lots of
- * changes/inserts in that tx or if the downstream is
- * slow/unresonsive.
- *
- * (Continuing to decode without the lock is unsafe, XXX)
- */
- LWLockRelease(TwoPhaseStateLock);
- return !data->twophase_decode_with_catalog_changes;
- }
- else if (TransactionIdDidAbort(txn->xid))
- {
- /*
- * Here we know that it is already aborted and there is
- * not much sense in doing something with this
- * transaction. Consequently ABORT PREPARED will be
- * suppressed.
- */
- LWLockRelease(TwoPhaseStateLock);
- return true;
- }
-
- LWLockRelease(TwoPhaseStateLock);
- }
-
- return false;
}
@@ -374,9 +545,10 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
{
TestDecodingData *data = ctx->output_plugin_private;
+ int backend_procno;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
@@ -391,6 +563,10 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
timestamptz_to_str(txn->commit_time));
OutputPluginWrite(ctx, true);
+
+ /* Unlock backend */
+ sscanf(txn->gid, "test_decoding:%d", &backend_procno);
+ SetLatch(&ProcGlobal->allProcs[backend_procno].procLatch);
}
/* COMMIT PREPARED callback */
@@ -400,8 +576,8 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
{
TestDecodingData *data = ctx->output_plugin_private;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
@@ -425,8 +601,8 @@ pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
{
TestDecodingData *data = ctx->output_plugin_private;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
diff --git a/repconsumer.py b/repconsumer.py
new file mode 100644
index 0000000..b31cbb5
--- /dev/null
+++ b/repconsumer.py
@@ -0,0 +1,17 @@
+import psycopg2
+from psycopg2.extras import LogicalReplicationConnection
+
+conn = psycopg2.connect("dbname=regression", connection_factory=LogicalReplicationConnection)
+cur = conn.cursor()
+
+cur.create_replication_slot("slotpy",
+ slot_type=psycopg2.extras.REPLICATION_LOGICAL,
+ output_plugin='test_decoding')
+
+cur.start_replication("slotpy")
+
+def consumer(msg):
+ print(msg.payload)
+
+cur.consume_stream(consumer)
+
diff --git a/runtest.sh b/runtest.sh
new file mode 100644
index 0000000..1c8b594
--- /dev/null
+++ b/runtest.sh
@@ -0,0 +1,34 @@
+#!/bin/sh
+
+# this script assumes that postgres and test_decodong is installed
+# (srcdir)/tmp_install
+
+rm -rf tmp_install/data1
+./tmp_install/bin/initdb -D ./tmp_install/data1
+./tmp_install/bin/pg_ctl -w -D ./tmp_install/data1 -l logfile start
+./tmp_install/bin/createdb regression
+
+cat >> ./tmp_install/data1/postgresql.conf <<-CONF
+ wal_level=logical
+ max_replication_slots=4
+ max_prepared_transactions=20
+ shared_preload_libraries='test_decoding'
+ wal_sender_timeout=600000
+CONF
+./tmp_install/bin/pg_ctl -w -D ./tmp_install/data1 -l logfile restart
+
+python3 repconsumer.py > xlog_decoded &
+REPCONSUMER_PID=$!
+
+sleep 3
+
+cd src/test/regress
+
+./pg_regress --inputdir=. --bindir='../../../tmp_install/bin' --dlpath=. --schedule=./parallel_schedule --use-existing
+
+# ./pg_regress --inputdir=. --bindir='../../../tmp_install/bin' --dlpath=. --schedule=./serial_schedule --use-existing
+
+cd ../../..
+
+kill $REPCONSUMER_PID
+./tmp_install/bin/pg_ctl -D ./tmp_install/data1 -l logfile stop
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9e407d5..322da32 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1928,6 +1928,7 @@ StartTransaction(void)
*/
s->state = TRANS_INPROGRESS;
+ CallXactCallbacks(XACT_EVENT_START);
ShowTransactionState("StartTransaction");
}
@@ -2264,9 +2265,12 @@ PrepareTransaction(void)
* transaction. That seems to require much more bookkeeping though.
*/
if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPREL))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
+ {
+ if (strncmp(prepareGID, "test_decoding:", 14) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
+ }
/*
* Likewise, don't allow PREPARE after pg_export_snapshot. This could be
@@ -2749,6 +2753,8 @@ CommitTransactionCommand(void)
{
TransactionState s = CurrentTransactionState;
+ CallXactCallbacks(XACT_EVENT_COMMIT_COMMAND);
+
switch (s->blockState)
{
/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e8bf39b..e884138 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -102,6 +102,7 @@ extern int MyXactFlags;
*/
typedef enum
{
+ XACT_EVENT_START,
XACT_EVENT_COMMIT,
XACT_EVENT_PARALLEL_COMMIT,
XACT_EVENT_ABORT,
@@ -109,7 +110,8 @@ typedef enum
XACT_EVENT_PREPARE,
XACT_EVENT_PRE_COMMIT,
XACT_EVENT_PARALLEL_PRE_COMMIT,
- XACT_EVENT_PRE_PREPARE
+ XACT_EVENT_PRE_PREPARE,
+ XACT_EVENT_COMMIT_COMMAND
} XactEvent;
typedef void (*XactCallback) (XactEvent event, void *arg);
diff --git a/src/test/regress/sql/transactions.sql b/src/test/regress/sql/transactions.sql
index bf9cb05..de440e9 100644
--- a/src/test/regress/sql/transactions.sql
+++ b/src/test/regress/sql/transactions.sql
@@ -39,11 +39,11 @@ SELECT * FROM aggtest;
CREATE TABLE writetest (a int);
CREATE TEMPORARY TABLE temptest (a int);
-BEGIN;
-SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
-SELECT * FROM writetest; -- ok
-SET TRANSACTION READ WRITE; --fail
-COMMIT;
+-- BEGIN;
+-- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
+-- SELECT * FROM writetest; -- ok
+-- SET TRANSACTION READ WRITE; --fail
+-- COMMIT;
BEGIN;
SET TRANSACTION READ ONLY; -- ok
On Thu, Mar 30, 2017 at 12:55 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 28 Mar 2017, at 18:08, Andres Freund <andres@anarazel.de> wrote:
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote:
That assertion is obviously false... the plugin can resolve this in
various ways, if we allow it.Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc). Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.You can say that in your opinion you prefer to see this handled in
some higher level way, though it would be good to hear why and how.It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.Bottom line here is we shouldn't reject this patch on this point,
I think it definitely has to be rejected because of that. And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.Okay. In order to find more realistic cases that blocks replication
i’ve created following setup:* in backend: tests_decoding plugins hooks on xact events and utility
statement hooks and transform each commit into prepare, then sleeps
on latch. If transaction contains DDL that whole statement pushed in
wal as transactional message. If DDL can not be prepared or disallows
execution in transaction block than it goes as nontransactional logical
message without prepare/decode injection. If transaction didn’t issued any
DDL and didn’t write anything to wal, then it skips 2pc too.* after prepare is decoded, output plugin in walsender unlocks backend
allowing to proceed with commit prepared. So in case when decoding
tries to access blocked catalog everything should stop.* small python script that consumes decoded wal from walsender (thanks
Craig and Petr)After small acrobatics with that hooks I’ve managed to run whole
regression suite in parallel mode through such setup of test_decoding
without any deadlocks. I’ve added two xact_events to postgres and
allowedn prepare of transactions that touched temp tables since
they are heavily used in tests and creates a lot of noise in diffs.So it boils down to 3 failed regression tests out of 177, namely:
* transactions.sql — here commit of tx stucks with obtaining SafeSnapshot().
I didn’t look what is happening there specifically, but just checked that
walsender isn’t blocked. I’m going to look more closely at this.* prepared_xacts.sql — here select prepared_xacts() sees our prepared
tx. It is possible to filter them out, but obviously it works as expected.* guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx
from being prepared. I didn’t found the way to check presence of
pendingActions outside of async.c so decided to leave it as is.It seems that at least in regression tests nothing can block twophase
logical decoding. Is that strong enough argument to hypothesis that current
approach doesn’t creates deadlock except locks on catalog which should be
disallowed anyway?Patches attached. logical_twophase_v5 is slightly modified version of previous
patch merged with Craig’s changes. Second file is set of patches over previous
one, that implements logic i’ve just described. There is runtest.sh script that
setups postgres, runs python logical consumer in background and starts
regression test.
I reviewed this patch but when I tried to build contrib/test_decoding
I got the following error.
$ make
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -g -fpic -I. -I. -I../../src/include -D_GNU_SOURCE -c -o
test_decoding.o test_decoding.c -MMD -MP -MF .deps/test_decoding.Po
test_decoding.c: In function '_PG_init':
test_decoding.c:126: warning: assignment from incompatible pointer type
test_decoding.c: In function 'test_decoding_process_utility':
test_decoding.c:271: warning: passing argument 5 of
'PreviousProcessUtilityHook' from incompatible pointer type
test_decoding.c:271: note: expected 'struct QueryEnvironment *' but
argument is of type 'struct DestReceiver *'
test_decoding.c:271: warning: passing argument 6 of
'PreviousProcessUtilityHook' from incompatible pointer type
test_decoding.c:271: note: expected 'struct DestReceiver *' but
argument is of type 'char *'
test_decoding.c:271: error: too few arguments to function
'PreviousProcessUtilityHook'
test_decoding.c:276: warning: passing argument 5 of
'standard_ProcessUtility' from incompatible pointer type
../../src/include/tcop/utility.h:38: note: expected 'struct
QueryEnvironment *' but argument is of type 'struct DestReceiver *'
test_decoding.c:276: warning: passing argument 6 of
'standard_ProcessUtility' from incompatible pointer type
../../src/include/tcop/utility.h:38: note: expected 'struct
DestReceiver *' but argument is of type 'char *'
test_decoding.c:276: error: too few arguments to function
'standard_ProcessUtility'
test_decoding.c: At top level:
test_decoding.c:285: warning: 'test_decoding_twophase_commit' was used
with no prototype before its definition
make: *** [test_decoding.o] Error 1
---
After applied both patches the regression test 'make check' failed. I
think you should update expected/transactions.out file as well.
$ cat src/test/regress/regression.diffs
*** /home/masahiko/pgsql/source/postgresql/src/test/regress/expected/transactions.out
Mon May 2 09:16:02 2016
--- /home/masahiko/pgsql/source/postgresql/src/test/regress/results/transactions.out
Tue Apr 4 09:52:44 2017
***************
*** 43,58 ****
-- Read-only tests
CREATE TABLE writetest (a int);
CREATE TEMPORARY TABLE temptest (a int);
! BEGIN;
! SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
! SELECT * FROM writetest; -- ok
! a
! ---
! (0 rows)
!
! SET TRANSACTION READ WRITE; --fail
! ERROR: transaction read-write mode must be set before any query
! COMMIT;
BEGIN;
SET TRANSACTION READ ONLY; -- ok
SET TRANSACTION READ WRITE; -- ok
--- 43,53 ----
-- Read-only tests
CREATE TABLE writetest (a int);
CREATE TEMPORARY TABLE temptest (a int);
! -- BEGIN;
! -- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
! -- SELECT * FROM writetest; -- ok
! -- SET TRANSACTION READ WRITE; --fail
! -- COMMIT;
BEGIN;
SET TRANSACTION READ ONLY; -- ok
SET TRANSACTION READ WRITE; -- ok
======================================================================
There are still some unnecessary code in v5 patch.
---
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+ int backend_procno;
+
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
Could you please update these patches?
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 Apr 2017, at 04:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I reviewed this patch but when I tried to build contrib/test_decoding
I got the following error.
Thanks!
Yes, seems that 18ce3a4a changed ProcessUtility_hook signature.
Updated.
There are still some unnecessary code in v5 patch.
Actually second diff isn’t intended to be part of the patch, I've just shared
the way I ran regression test suite through the 2pc decoding changing
all commits to prepare/commits where commits happens only after decoding
of prepare is finished (more details in my previous message in this thread).
That is just argument against Andres concern that prepared transaction
is able to deadlock with decoding process — at least no such cases in
regression tests.
And that concern is main thing blocking this patch. Except explicit catalog
locks in prepared tx nobody yet found such cases and it is hard to address
or argue about.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
logical_twophase_v6.diffapplication/octet-stream; name=logical_twophase_v6.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..56c6e72 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_no2pc
+ data
+------
+(0 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+ data
+------
+(0 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,49 +91,169 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data
--------------------------------------------------------------------------
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3';
+ COMMIT PREPARED 'test_prepared#3';
+(5 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_no2pc
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+------
+(0 rows)
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+-- FIXME we expect a timeout here, but it actually works...
+ERROR: statement timed out
+
+RESET statement_timeout;
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..a94503c 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,36 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+:get_no2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +41,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+:get_no2pc
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 21cfd67..0f0bb1b 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,8 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +72,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +102,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +130,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
ctx->output_plugin_private = data;
@@ -176,6 +201,27 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +278,163 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special
+ * treatment.
+ *
+ * Right now we don't have a safe way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of
+ * decoding.
+ *
+ * That kind of problem arises only when we are trying to
+ * retrospectively decode aborted transactions with catalog changes -
+ * including if a transaction aborts while we're decoding it. If one
+ * wants to code distributed commit based on prepare decoding then
+ * commits/aborts will happend strictly after decoding will be
+ * completed, so it is possible to skip any checks/locks here.
+ *
+ * We'll also get stuck trying to acquire locks on catalog relations
+ * we need for decoding if the prepared xact holds a strong lock on
+ * one of them and we also need to decode row changes.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity, by default we just
+ * ignore in-progess prepared transactions with catalog
+ * changes in this extension. If they abort during
+ * decoding then tuples we need to decode them may be
+ * overwritten while we're still decoding, causing
+ * wrong catalog lookups.
+ *
+ * It is possible to move that LWLockRelease() to
+ * pg_decode_prepare_txn() and allow decoding of
+ * running prepared tx, but such lock will prevent any
+ * 2pc transaction commit during decoding time. That
+ * can be a long time in case of lots of
+ * changes/inserts in that tx or if the downstream is
+ * slow/unresonsive.
+ *
+ * (Continuing to decode without the lock is unsafe, XXX)
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return !data->twophase_decode_with_catalog_changes;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is
+ * not much sense in doing something with this
+ * transaction. Consequently ABORT PREPARED will be
+ * suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 735f8c5..ed75503 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..b58b9a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -129,7 +129,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -187,12 +186,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -854,7 +855,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -870,6 +871,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1021,6 +1024,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1031,6 +1035,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1061,9 +1080,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1239,6 +1268,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1392,11 +1458,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2055,7 +2122,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2082,7 +2150,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2144,7 +2212,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2166,7 +2235,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..9e407d5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1233,7 +1233,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1585,7 +1585,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -3471,7 +3472,7 @@ BeginTransactionBlock(void)
* resource owner, etc while executing inside a Portal.
*/
bool
-PrepareTransactionBlock(char *gid)
+PrepareTransactionBlock(const char *gid)
{
TransactionState s;
bool result;
@@ -5110,7 +5111,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5122,6 +5124,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5184,6 +5187,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5234,8 +5244,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5255,15 +5270,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5275,7 +5294,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5299,6 +5317,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5313,6 +5356,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5330,8 +5376,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..b1e39c55 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -551,8 +570,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * this function were called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +631,81 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +717,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
SnapBuildAbortTxn(ctx->snapshot_builder, buf->record->EndRecPtr, xid,
parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5529ac8..9a66194 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -122,6 +130,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -179,8 +188,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -650,6 +676,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -684,6 +797,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 12ebadc..558b302 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1308,25 +1308,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1605,8 +1598,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn->prepared)
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1633,8 +1629,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate or postpone that
+ * till the finish of two-phase tx
+ */
+ if (!txn->prepared)
+ ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
{
@@ -1668,6 +1668,119 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->prepared = true;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then presumably subscriber confirmed prepare
+ * but we are rebooted.
+ */
+ return txn == NULL ? true : txn->prepared;
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ rb->commit_prepared(rb, txn, commit_lsn);
+ else
+ rb->abort_prepared(rb, txn, commit_lsn);
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 2279604..c1ca998 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -885,7 +885,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1118,6 +1118,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->running.xip,
+ builder->running.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->running.xip + builder->running.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->running.xcnt--;
+
+ /* update min/max */
+ builder->running.xmin = builder->running.xip[0];
+ builder->running.xmax = builder->running.xip[builder->running.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b2b7848..6c0445a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(bool overwriteOK);
extern void RecoverPreparedTransactions(void);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..e8bf39b 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -157,6 +161,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -303,13 +308,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -320,6 +352,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -351,7 +387,7 @@ extern void CommitTransactionCommand(void);
extern void AbortCurrentTransaction(void);
extern void BeginTransactionBlock(void);
extern bool EndTransactionBlock(void);
-extern bool PrepareTransactionBlock(char *gid);
+extern bool PrepareTransactionBlock(const char *gid);
extern void UserAbortTransactionBlock(void);
extern void ReleaseSavepoint(List *options);
extern void DefineSavepoint(char *name);
@@ -385,12 +421,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7d6c88e..7352b07 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -75,6 +75,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -109,5 +114,4 @@ extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 08e962d..be32774 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 17e47b3..99aa17f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -144,6 +145,16 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
+ /*
+ * By using filter_prepare() callback we can force decoding to treat
+ * two-phase transaction as on ordinary one. This flag is set if we are
+ * actually called prepape() callback in output plugin.
+ */
+ bool prepared;
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -283,6 +294,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -318,6 +352,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -373,6 +411,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -396,6 +439,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index a8ae631..400ffe1 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -72,6 +72,10 @@ extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
extern void SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn,
TransactionId xid, int nsubxacts,
TransactionId *subxacts);
logical_twophase_regresstest.diffapplication/octet-stream; name=logical_twophase_regresstest.diff; x-unix-mode=0644Download
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0f0bb1b..aade24b 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -34,6 +34,19 @@
#include "utils/syscache.h"
#include "utils/typcache.h"
+#include "access/xact.h"
+#include "miscadmin.h"
+#include "executor/executor.h"
+#include "nodes/nodes.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/ipc.h"
+#include "pgstat.h"
+#include "tcop/utility.h"
+#include "commands/portalcmds.h"
+
PG_MODULE_MAGIC;
/* These must be available to pg_dlsym() */
@@ -85,11 +98,234 @@ static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);
+static void test_decoding_xact_callback(XactEvent event, void *arg);
+
+static void test_decoding_process_utility(PlannedStmt *pstmt,
+ const char *queryString, ProcessUtilityContext context,
+ ParamListInfo params, QueryEnvironment *queryEnv,
+ DestReceiver *dest, char *completionTag);
+
+static bool test_decoding_twophase_commit();
+
+static void test_decoding_executor_finish(QueryDesc *queryDesc);
+
+static ProcessUtility_hook_type PreviousProcessUtilityHook;
+
+static ExecutorFinish_hook_type PreviousExecutorFinishHook;
+
+static bool CurrentTxContainsDML;
+static bool CurrentTxContainsDDL;
+static bool CurrentTxNonpreparable;
void
_PG_init(void)
{
- /* other plugins can perform things here */
+ PreviousExecutorFinishHook = ExecutorFinish_hook;
+ ExecutorFinish_hook = test_decoding_executor_finish;
+
+ PreviousProcessUtilityHook = ProcessUtility_hook;
+ ProcessUtility_hook = test_decoding_process_utility;
+
+ if (!IsUnderPostmaster)
+ RegisterXactCallback(test_decoding_xact_callback, NULL);
+}
+
+
+/* ability to hook into sigle-statement transaction */
+static void
+test_decoding_xact_callback(XactEvent event, void *arg)
+{
+ switch (event)
+ {
+ case XACT_EVENT_START:
+ case XACT_EVENT_ABORT:
+ CurrentTxContainsDML = false;
+ CurrentTxContainsDDL = false;
+ CurrentTxNonpreparable = false;
+ break;
+ case XACT_EVENT_COMMIT_COMMAND:
+ if (!IsTransactionBlock())
+ test_decoding_twophase_commit();
+ break;
+ default:
+ break;
+ }
+}
+
+/* find out whether transaction had wrote any data or not */
+static void
+test_decoding_executor_finish(QueryDesc *queryDesc)
+{
+ CmdType operation = queryDesc->operation;
+ EState *estate = queryDesc->estate;
+ if (estate->es_processed != 0 &&
+ (operation == CMD_INSERT || operation == CMD_UPDATE || operation == CMD_DELETE))
+ {
+ int i;
+ for (i = 0; i < estate->es_num_result_relations; i++)
+ {
+ Relation rel = estate->es_result_relations[i].ri_RelationDesc;
+ if (RelationNeedsWAL(rel)) {
+ CurrentTxContainsDML = true;
+ break;
+ }
+ }
+ }
+
+ if (PreviousExecutorFinishHook != NULL)
+ PreviousExecutorFinishHook(queryDesc);
+ else
+ standard_ExecutorFinish(queryDesc);
+}
+
+
+/*
+ * Several things here:
+ * 1) hook into commit of transaction block
+ * 2) write logical message for DDL (default path)
+ * 3) prevent 2pc hook for tx that can not be prepared and
+ * send them as logical nontransactional message.
+ */
+static void
+test_decoding_process_utility(PlannedStmt *pstmt,
+ const char *queryString, ProcessUtilityContext context,
+ ParamListInfo params, QueryEnvironment *queryEnv,
+ DestReceiver *dest, char *completionTag)
+{
+ Node *parsetree = pstmt->utilityStmt;
+
+ switch (nodeTag(parsetree))
+ {
+ case T_TransactionStmt:
+ {
+ TransactionStmt *stmt = (TransactionStmt *) parsetree;
+ switch (stmt->kind)
+ {
+ case TRANS_STMT_COMMIT:
+ if (test_decoding_twophase_commit())
+ return; /* do not proceed */
+ break;
+ default:
+ break;
+ }
+ }
+ break;
+
+ /* cannot PREPARE a transaction that has executed LISTEN, UNLISTEN, or NOTIFY */
+ case T_NotifyStmt:
+ case T_ListenStmt:
+ case T_UnlistenStmt:
+ CurrentTxNonpreparable = true;
+ break;
+
+ /* create/reindex/drop concurrently can not be execuled in prepared tx */
+ case T_ReindexStmt:
+ {
+ ReindexStmt *stmt = (ReindexStmt *) parsetree;
+ switch (stmt->kind)
+ {
+ case REINDEX_OBJECT_SCHEMA:
+ case REINDEX_OBJECT_SYSTEM:
+ case REINDEX_OBJECT_DATABASE:
+ CurrentTxNonpreparable = true;
+ default:
+ break;
+ }
+ }
+ break;
+ case T_IndexStmt:
+ {
+ IndexStmt *indexStmt = (IndexStmt *) parsetree;
+ if (indexStmt->concurrent)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+ case T_DropStmt:
+ {
+ DropStmt *stmt = (DropStmt *) parsetree;
+ if (stmt->removeType == OBJECT_INDEX && stmt->concurrent)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+
+ /* cannot PREPARE a transaction that has created a cursor WITH HOLD */
+ case T_DeclareCursorStmt:
+ {
+ DeclareCursorStmt *stmt = (DeclareCursorStmt *) parsetree;
+ if (stmt->options & CURSOR_OPT_HOLD)
+ CurrentTxNonpreparable = true;
+ }
+ break;
+
+ default:
+ LogLogicalMessage("D", queryString, strlen(queryString) + 1, true);
+ CurrentTxContainsDDL = true;
+ break;
+ }
+
+ /* Send non-transactional message then */
+ if (CurrentTxNonpreparable)
+ LogLogicalMessage("C", queryString, strlen(queryString) + 1, false);
+
+ if (PreviousProcessUtilityHook != NULL)
+ {
+ PreviousProcessUtilityHook(pstmt, queryString, context, params, queryEnv,
+ dest, completionTag);
+ }
+ else
+ {
+ standard_ProcessUtility(pstmt, queryString, context, params, queryEnv,
+ dest, completionTag);
+ }
+}
+
+/*
+ * Change commit to prepare and wait on latch.
+ * WalSender will unlock us after decoding and we can proceed.
+ */
+static bool
+test_decoding_twophase_commit()
+{
+ int result = 0;
+ char gid[20];
+
+ if (IsAutoVacuumLauncherProcess() ||
+ !IsNormalProcessingMode() ||
+ am_walsender ||
+ IsBackgroundWorker ||
+ IsAutoVacuumWorkerProcess() ||
+ IsAbortedTransactionBlockState() ||
+ !(CurrentTxContainsDML || CurrentTxContainsDDL) ||
+ CurrentTxNonpreparable )
+ return false;
+
+ snprintf(gid, sizeof(gid), "test_decoding:%d", MyProc->pgprocno);
+
+ if (!IsTransactionBlock())
+ {
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ }
+ if (!PrepareTransactionBlock(gid))
+ {
+ fprintf(stderr, "Can't prepare transaction '%s'\n", gid);
+ }
+ CommitTransactionCommand();
+
+ result = WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, 0,
+ WAIT_EVENT_REPLICATION_SLOT_SYNC);
+
+ if (result & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ if (result & WL_LATCH_SET)
+ ResetLatch(&MyProc->procLatch);
+
+
+ StartTransactionCommand();
+ FinishPreparedTransaction(gid, true);
+ return true;
}
/* specify output plugin callbacks */
@@ -297,74 +533,11 @@ static bool
pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
char *gid)
{
- TestDecodingData *data = ctx->output_plugin_private;
-
- /* treat all transaction as one-phase */
- if (!data->twophase_decoding)
+ /* decode only tx that are prepared by our hook */
+ if (strncmp(gid, "test_decoding:", 14) == 0)
+ return false;
+ else
return true;
-
- /*
- * Two-phase transactions that accessed catalog require special
- * treatment.
- *
- * Right now we don't have a safe way to decode catalog changes made in
- * prepared transaction that was already aborted by the time of
- * decoding.
- *
- * That kind of problem arises only when we are trying to
- * retrospectively decode aborted transactions with catalog changes -
- * including if a transaction aborts while we're decoding it. If one
- * wants to code distributed commit based on prepare decoding then
- * commits/aborts will happend strictly after decoding will be
- * completed, so it is possible to skip any checks/locks here.
- *
- * We'll also get stuck trying to acquire locks on catalog relations
- * we need for decoding if the prepared xact holds a strong lock on
- * one of them and we also need to decode row changes.
- */
- if (txn->has_catalog_changes)
- {
- LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
-
- if (TransactionIdIsInProgress(txn->xid))
- {
- /*
- * For the sake of simplicity, by default we just
- * ignore in-progess prepared transactions with catalog
- * changes in this extension. If they abort during
- * decoding then tuples we need to decode them may be
- * overwritten while we're still decoding, causing
- * wrong catalog lookups.
- *
- * It is possible to move that LWLockRelease() to
- * pg_decode_prepare_txn() and allow decoding of
- * running prepared tx, but such lock will prevent any
- * 2pc transaction commit during decoding time. That
- * can be a long time in case of lots of
- * changes/inserts in that tx or if the downstream is
- * slow/unresonsive.
- *
- * (Continuing to decode without the lock is unsafe, XXX)
- */
- LWLockRelease(TwoPhaseStateLock);
- return !data->twophase_decode_with_catalog_changes;
- }
- else if (TransactionIdDidAbort(txn->xid))
- {
- /*
- * Here we know that it is already aborted and there is
- * not much sense in doing something with this
- * transaction. Consequently ABORT PREPARED will be
- * suppressed.
- */
- LWLockRelease(TwoPhaseStateLock);
- return true;
- }
-
- LWLockRelease(TwoPhaseStateLock);
- }
-
- return false;
}
@@ -374,9 +547,10 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
{
TestDecodingData *data = ctx->output_plugin_private;
+ int backend_procno;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
@@ -391,6 +565,10 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
timestamptz_to_str(txn->commit_time));
OutputPluginWrite(ctx, true);
+
+ /* Unlock backend */
+ sscanf(txn->gid, "test_decoding:%d", &backend_procno);
+ SetLatch(&ProcGlobal->allProcs[backend_procno].procLatch);
}
/* COMMIT PREPARED callback */
@@ -400,8 +578,8 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
{
TestDecodingData *data = ctx->output_plugin_private;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
@@ -425,8 +603,8 @@ pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
{
TestDecodingData *data = ctx->output_plugin_private;
- if (data->skip_empty_xacts && !data->xact_wrote_changes)
- return;
+ // if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ // return;
OutputPluginPrepareWrite(ctx, true);
diff --git a/repconsumer.py b/repconsumer.py
new file mode 100644
index 0000000..b31cbb5
--- /dev/null
+++ b/repconsumer.py
@@ -0,0 +1,17 @@
+import psycopg2
+from psycopg2.extras import LogicalReplicationConnection
+
+conn = psycopg2.connect("dbname=regression", connection_factory=LogicalReplicationConnection)
+cur = conn.cursor()
+
+cur.create_replication_slot("slotpy",
+ slot_type=psycopg2.extras.REPLICATION_LOGICAL,
+ output_plugin='test_decoding')
+
+cur.start_replication("slotpy")
+
+def consumer(msg):
+ print(msg.payload)
+
+cur.consume_stream(consumer)
+
diff --git a/runtest.sh b/runtest.sh
new file mode 100644
index 0000000..1c8b594
--- /dev/null
+++ b/runtest.sh
@@ -0,0 +1,34 @@
+#!/bin/sh
+
+# this script assumes that postgres and test_decodong is installed
+# (srcdir)/tmp_install
+
+rm -rf tmp_install/data1
+./tmp_install/bin/initdb -D ./tmp_install/data1
+./tmp_install/bin/pg_ctl -w -D ./tmp_install/data1 -l logfile start
+./tmp_install/bin/createdb regression
+
+cat >> ./tmp_install/data1/postgresql.conf <<-CONF
+ wal_level=logical
+ max_replication_slots=4
+ max_prepared_transactions=20
+ shared_preload_libraries='test_decoding'
+ wal_sender_timeout=600000
+CONF
+./tmp_install/bin/pg_ctl -w -D ./tmp_install/data1 -l logfile restart
+
+python3 repconsumer.py > xlog_decoded &
+REPCONSUMER_PID=$!
+
+sleep 3
+
+cd src/test/regress
+
+./pg_regress --inputdir=. --bindir='../../../tmp_install/bin' --dlpath=. --schedule=./parallel_schedule --use-existing
+
+# ./pg_regress --inputdir=. --bindir='../../../tmp_install/bin' --dlpath=. --schedule=./serial_schedule --use-existing
+
+cd ../../..
+
+kill $REPCONSUMER_PID
+./tmp_install/bin/pg_ctl -D ./tmp_install/data1 -l logfile stop
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9e407d5..322da32 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1928,6 +1928,7 @@ StartTransaction(void)
*/
s->state = TRANS_INPROGRESS;
+ CallXactCallbacks(XACT_EVENT_START);
ShowTransactionState("StartTransaction");
}
@@ -2264,9 +2265,12 @@ PrepareTransaction(void)
* transaction. That seems to require much more bookkeeping though.
*/
if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPREL))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
+ {
+ if (strncmp(prepareGID, "test_decoding:", 14) != 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot PREPARE a transaction that has operated on temporary tables")));
+ }
/*
* Likewise, don't allow PREPARE after pg_export_snapshot. This could be
@@ -2749,6 +2753,8 @@ CommitTransactionCommand(void)
{
TransactionState s = CurrentTransactionState;
+ CallXactCallbacks(XACT_EVENT_COMMIT_COMMAND);
+
switch (s->blockState)
{
/*
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e8bf39b..e884138 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -102,6 +102,7 @@ extern int MyXactFlags;
*/
typedef enum
{
+ XACT_EVENT_START,
XACT_EVENT_COMMIT,
XACT_EVENT_PARALLEL_COMMIT,
XACT_EVENT_ABORT,
@@ -109,7 +110,8 @@ typedef enum
XACT_EVENT_PREPARE,
XACT_EVENT_PRE_COMMIT,
XACT_EVENT_PARALLEL_PRE_COMMIT,
- XACT_EVENT_PRE_PREPARE
+ XACT_EVENT_PRE_PREPARE,
+ XACT_EVENT_COMMIT_COMMAND
} XactEvent;
typedef void (*XactCallback) (XactEvent event, void *arg);
diff --git a/src/test/regress/sql/transactions.sql b/src/test/regress/sql/transactions.sql
index bf9cb05..de440e9 100644
--- a/src/test/regress/sql/transactions.sql
+++ b/src/test/regress/sql/transactions.sql
@@ -39,11 +39,11 @@ SELECT * FROM aggtest;
CREATE TABLE writetest (a int);
CREATE TEMPORARY TABLE temptest (a int);
-BEGIN;
-SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
-SELECT * FROM writetest; -- ok
-SET TRANSACTION READ WRITE; --fail
-COMMIT;
+-- BEGIN;
+-- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok
+-- SELECT * FROM writetest; -- ok
+-- SET TRANSACTION READ WRITE; --fail
+-- COMMIT;
BEGIN;
SET TRANSACTION READ ONLY; -- ok
On Tue, Apr 4, 2017 at 7:06 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 4 Apr 2017, at 04:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I reviewed this patch but when I tried to build contrib/test_decoding
I got the following error.Thanks!
Yes, seems that 18ce3a4a changed ProcessUtility_hook signature.
Updated.There are still some unnecessary code in v5 patch.
Thank you for updating the patch!
Actually second diff isn’t intended to be part of the patch, I've just shared
the way I ran regression test suite through the 2pc decoding changing
all commits to prepare/commits where commits happens only after decoding
of prepare is finished (more details in my previous message in this thread).
Understood. Sorry for the noise.
That is just argument against Andres concern that prepared transaction
is able to deadlock with decoding process — at least no such cases in
regression tests.And that concern is main thing blocking this patch. Except explicit catalog
locks in prepared tx nobody yet found such cases and it is hard to address
or argue about.
Hmm, I also has not found such deadlock case yet.
Other than that issue current patch still could not pass 'make check'
test of contrib/test_decoding.
*** 154,167 ****
(4 rows)
:get_with2pc
! data
! -------------------------------------------------------------------------
! BEGIN
! table public.test_prepared1: INSERT: id[integer]:5
! table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
! PREPARE TRANSACTION 'test_prepared#3';
! COMMIT PREPARED 'test_prepared#3';
! (5 rows)
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
--- 154,162 ----
(4 rows)
:get_with2pc
! data
! ------
! (0 rows)
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
I guess that the this part is a unexpected result and should be fixed. Right?
-----
*** 215,222 ****
-- If we try to decode it now we'll deadlock
SET statement_timeout = '10s';
:get_with2pc_nofilter
! -- FIXME we expect a timeout here, but it actually works...
! ERROR: statement timed out
RESET statement_timeout;
-- we can decode past it by skipping xacts with catalog changes
--- 210,222 ----
-- If we try to decode it now we'll deadlock
SET statement_timeout = '10s';
:get_with2pc_nofilter
! data
! ----------------------------------------------------------------------------
! BEGIN
! table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
! table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
! PREPARE TRANSACTION 'test_prepared_lock'
! (4 rows)
RESET statement_timeout;
-- we can decode past it by skipping xacts with catalog changes
Probably we can ignore this part for now.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-04-04 13:06:13 +0300, Stas Kelvich wrote:
That is just argument against Andres concern that prepared transaction
is able to deadlock with decoding process — at least no such cases in
regression tests.
There's few longer / adverse xacts, that doesn't say much.
And that concern is main thing blocking this patch. Except explicit catalog
locks in prepared tx nobody yet found such cases and it is hard to address
or argue about.
I doubt that's the case. But even if it were so, it's absolutely not
acceptable that a plain user can cause such deadlocks. So I don't think
this argument buys you anything.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi
On 4 April 2017 at 19:13, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Other than that issue current patch still could not pass 'make check'
test of contrib/test_decoding.
Just a note about this patch. Of course time flies by and it needs rebase,
but also there are few failing tests right now:
* one that was already mentioned by Masahiko
* one from `ddl`, where expected is:
```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_wal
FROM pg_replication_slots;
slot_name | plugin | slot_type | active | catalog_xmin_set |
data_xmin_not_set | some_wal
-----------------+---------------+-----------+--------+------------------+-------------------+----------
regression_slot | test_decoding | logical | f | t |
t | t
(1 row)
```
but the result is:
```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_wal
FROM pg_replication_slots;
ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist
LINE 5: pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_w...
^
HINT: No function matches the given name and argument types. You might
need to add explicit type casts.
```
Dmitry Dolgov <9erthalion6@gmail.com> writes:
Just a note about this patch. Of course time flies by and it needs rebase,
but also there are few failing tests right now:
ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist
Apparently you are not testing against current HEAD. That's been there
since d10c626de (a whole two days now ;-)).
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 May 2017 at 22:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Apparently you are not testing against current HEAD. That's been there
since d10c626de (a whole two days now ;-))
Indeed, I was working on a more than two-day old antiquity. Unfortunately,
it's even more complicated
to apply this patch against the current HEAD, so I'll wait for a rebased
version.
Hi,
FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.
Regards,
Nikhils
On 14 May 2017 at 04:02, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On 13 May 2017 at 22:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Apparently you are not testing against current HEAD. That's been there
since d10c626de (a whole two days now ;-))Indeed, I was working on a more than two-day old antiquity. Unfortunately,
it's even more complicated
to apply this patch against the current HEAD, so I'll wait for a rebased
version.
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
Hi,
FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.
Cool!
So what kind of architecture do you have in mind? Same way as is it was implemented before?
As far as I remember there were two main issues:
* Decodong of aborted prepared transaction.
If such transaction modified catalog then we can’t read reliable info with our historic snapshot,
since clog already have aborted bit for our tx it will brake visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting number of tuples (details
upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form
of distributed commit, so naturally decoding will happens only with in-progress transactions and
we commit/abort will happen only after it is decoded, sent and response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc.
* Possible deadlocks that Andres was talking about.
I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-27 14:46, Stas Kelvich wrote:
On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:Hi,
FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.Cool!
So what kind of architecture do you have in mind? Same way as is it
was implemented before?
As far as I remember there were two main issues:* Decodong of aborted prepared transaction.
If such transaction modified catalog then we can’t read reliable info
with our historic snapshot,
since clog already have aborted bit for our tx it will brake
visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting
number of tuples (details
upthread) or by hijacking clog values in historic visibility function.
But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC
transaction is to do some form
of distributed commit, so naturally decoding will happens only with
in-progress transactions and
we commit/abort will happen only after it is decoded, sent and
response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being
decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding
this lock for ordinary 2pc.* Possible deadlocks that Andres was talking about.
I spend some time trying to find that, but didn’t find any. If locking
pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such
transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably
need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.
Just rebased patch logical_twophase_v6 to master.
Fixed small issues:
- XactLogAbortRecord wrote DBINFO twice, but it was decoded in
ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN.
Fixed by removing second write of DBINFO.
- SnapBuildPrepareTxnFinish tried to remove xid from `running` instead
of `committed`. And it removed only xid, without subxids.
- test_decoding skipped returning "COMMIT PREPARED" and "ABORT
PREPARED",
Big issue were with decoding ddl-including two-phase transactions:
- prepared.out were misleading. We could not reproduce decoding body of
"test_prepared#3" with logical_twophase_v6.diff. It was skipped if
`pg_logical_slot_get_changes` were called without
`twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED
test_prepared#3" were decoded.
The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare`
twice:
- first on "PREPARE" itself,
- second - on "COMMIT PREPARED".
In v6, `pg_filter_prepare` without `with-catalog-changes` first time
answered "true" (ie it should not be decoded), and second time (when
transaction became committed) it answered "false" (ie it should be
decoded). But second time in DecodePrepare
`ctx->snapshot_builder->start_decoding_at`
is already in a future compared to `buf->origptr` (because it is
on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called
ReorderBufferForget.
If `pg_filter_prepare` is called with `with-catalog-changes`, then
it returns "false" both times, thus DeocdePrepare decodes transaction
in first time, and calls `ReorderBufferForget` in second time.
I didn't found a way to fix it gracefully. I just change
`pg_filter_prepare`
to return same answer both times: "false" if called
`with-catalog-changes`
(ie need to call DecodePrepare), and "true" otherwise. With this
change, catalog changing two-phase transaction is decoded as simple
one-phase transaction, if `pg_logical_slot_get_changes` is called
without `with-catalog-changes`.
--
With regards,
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachments:
On 2017-10-26 22:01, Sokolov Yura wrote:
On 2017-09-27 14:46, Stas Kelvich wrote:
On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:Hi,
FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.Cool!
So what kind of architecture do you have in mind? Same way as is it
was implemented before?
As far as I remember there were two main issues:* Decodong of aborted prepared transaction.
If such transaction modified catalog then we can’t read reliable info
with our historic snapshot,
since clog already have aborted bit for our tx it will brake
visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting
number of tuples (details
upthread) or by hijacking clog values in historic visibility function.
But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC
transaction is to do some form
of distributed commit, so naturally decoding will happens only with
in-progress transactions and
we commit/abort will happen only after it is decoded, sent and
response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being
decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding
this lock for ordinary 2pc.* Possible deadlocks that Andres was talking about.
I spend some time trying to find that, but didn’t find any. If locking
pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such
transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably
need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.Just rebased patch logical_twophase_v6 to master.
Fixed small issues:
- XactLogAbortRecord wrote DBINFO twice, but it was decoded in
ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN.
Fixed by removing second write of DBINFO.
- SnapBuildPrepareTxnFinish tried to remove xid from `running` instead
of `committed`. And it removed only xid, without subxids.
- test_decoding skipped returning "COMMIT PREPARED" and "ABORT
PREPARED",Big issue were with decoding ddl-including two-phase transactions:
- prepared.out were misleading. We could not reproduce decoding body of
"test_prepared#3" with logical_twophase_v6.diff. It was skipped if
`pg_logical_slot_get_changes` were called without
`twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED
test_prepared#3" were decoded.
The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare`
twice:
- first on "PREPARE" itself,
- second - on "COMMIT PREPARED".
In v6, `pg_filter_prepare` without `with-catalog-changes` first time
answered "true" (ie it should not be decoded), and second time (when
transaction became committed) it answered "false" (ie it should be
decoded). But second time in DecodePrepare
`ctx->snapshot_builder->start_decoding_at`
is already in a future compared to `buf->origptr` (because it is
on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called
ReorderBufferForget.
If `pg_filter_prepare` is called with `with-catalog-changes`, then
it returns "false" both times, thus DeocdePrepare decodes transaction
in first time, and calls `ReorderBufferForget` in second time.I didn't found a way to fix it gracefully. I just change
`pg_filter_prepare`
to return same answer both times: "false" if called
`with-catalog-changes`
(ie need to call DecodePrepare), and "true" otherwise. With this
change, catalog changing two-phase transaction is decoded as simple
one-phase transaction, if `pg_logical_slot_get_changes` is called
without `with-catalog-changes`.
Small improvement compared to v7:
- twophase_gid is written with alignment padding in the
XactLogCommitRecord
and XactLogAbortRecord.
--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
Attachments:
On 28 October 2017 at 03:53, Sokolov Yura <y.sokolov@postgrespro.ru> wrote:
On 2017-10-26 22:01, Sokolov Yura wrote:
Small improvement compared to v7:
- twophase_gid is written with alignment padding in the XactLogCommitRecord
and XactLogAbortRecord.
I think Nikhils has done some significant work on this patch.
Hopefully he'll be able to share it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi all,
I think Nikhils has done some significant work on this patch.
Hopefully he'll be able to share it.
PFA, latest patch. This builds on top of the last patch submitted by
Sokolov Yura and adds the actual logical replication interfaces to
allow PREPARE or COMMIT/ROLLBACK PREPARED on a logical subscriber.
I tested with latest PG head by setting up PUBLICATION/SUBSCRIPTION
for some tables. I tried DML on these tables via 2PC and it seems to
work with subscribers honoring COMMIT|ROLLBACK PREPARED commands.
Now getting back to the two main issues that we have been discussing:
Logical decoding deadlocking/hanging due to locks on catalog tables
====================================================
When we are decoding, we do not hold long term locks on the table. We
do RelationIdGetRelation() and RelationClose() which
increments/decrements ref counts. Also this ref count is held/released
per ReorderBuffer change record. The call to RelationIdGetRelation()
holds an AccessShareLock on pg_class, pg_attribute etc. while building
the relation descriptor. The plugin itself can access rel/syscache but
none of it holds a lock stronger than AccessShareLock on the catalog
tables.
Even activities like:
ALTER user_table;
CLUSTER user_table;
Do not hold locks that will allow decoding to stall.
The only issue could be with locks on catalog objects itself in the
prepared transaction.
Now if the 2PC transaction is taking an AccessExclusiveLock on catalog
objects via "LOCK pg_class"
for example, then pretty much nothing else will progress ahead in
other sessions in the database
till this active session COMMIT PREPAREs or aborts this 2PC transaction.
Also, in some cases like CLUSTER on catalog objects, the code
explicitly denies preparation of a 2PC transaction.
postgres=# BEGIN;
postgres=# CLUSTER pg_class using pg_class_oid_index ;
postgres=# PREPARE TRANSACTION 'test_prepared_lock';
ERROR: cannot PREPARE a transaction that modified relation mapping
This makes sense because we do not want to get into a state where the
DB is unable to progress meaningfully at all.
Is there any other locking scenario that we need to consider?
Otherwise, are we all ok on this point being a non-issue for 2PC
logical decoding?
Now on to the second issue:
2PC Logical decoding with concurrent "ABORT PREPARED" of the same
=========================================================
Before 2PC, we always decoded regular committed transaction records.
Now with prepared
transactions, we run the risk of running decoding when some other
backend could come in and
COMMIT PREPARE or ROLLBACK PREPARE simultaneously. If the other backend commits,
that's not an issue at all.
The issue is with a concurrent rollback of the prepared transaction.
We need a way to ensure that
the 2PC does not abort when we are in the midst of a change record
apply activity.
One way to handle this is to ensure that we interlock the abort
prepared with an ongoing logical decoding operation for a bounded
period of maximum one change record apply cycle.
I am outlining one solution but am all ears for better, elegant solutions.
* We introduce two new booleans in the TwoPhaseState
GlobalTransactionData structure.
bool beingdecoded;
bool abortpending;
1) Before we start iterating through the change records, if it happens
to be a prepared transaction, we
check "abortpending" in the corresponding TwoPhaseState entry. If it's
not set, then we set "beingdecoded".
If abortpending is set, we know that this transaction is going to go
away and we treat it like a regular abort and do
not do any decoding at all.
2) With "beingdecoded" set, we start with the first change record from
the iteration, decode it and apply it.
3) Before starting decode of the next change record, we re-check if
"abortpending" is set. If "abortpending"
is set, we do not decode the next change record. Thus the abort is
delay-bounded to a maximum of one change record decoding/apply cycle
after we signal our intent to abort it. Then, we need to send ABORT
(regular, not rollback prepared, since we have not sent "PREPARE" yet.
We cannot send PREPARE midways because the transaction block on the
whole might not be consistent) to the subscriber. We will have to add
an ABORT callback in pgoutput for this. There's only a COMMIT callback
as of now. The subscribers will ABORT this transaction midways due to
this. We can then follow this up with a DUMMY prepared txn. E.g.
"BEGIN; PREPARE TRANSACTION 'gid'"; The reasoning for the DUMMY 2PC is
mentioned below in (6).
4) Keep decoding change records as long as "abortpending" is not set.
5) At end of the change set, send "PREPARE" to the subscribers and
then remove the "beingdecoded" flag from the TwoPhaseState entry. We
are now free to commit/rollback the prepared transaction anytime.
6) We will still decode the "ROLLBACK PREPARED" wal entry when it
comes to us on the provider. This will call the abort_prepared
callback on the subscriber. I have already added this in my patch.
This abort_prepared callback will abort the dummy PREPARED query from
step (3) above. Instead of doing this, we could actually check if the
'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
But in that case we can't be sure if the GID does not exist because of
a rollback-during-decode-issue on the provider or due to something
else. If we are ok with not finding GIDs on the subscriber side, then
am fine with removing the DUMMY prepare from step (3).
7) When the above activity is happening if another backend wants to
abort the prepared transaction then it will set "abortpending". If
"beingdecoded" is true, the abort prepared function will wait till it
clears out by releasing the lock and re-checking in a few moments.
When beingdecoded clears out (which will happen before the next change
record apply in walsender when it sees "abortpending" set) , the abort
prepare can go ahead as usual.
Note that we will have to be careful to clear this "beingdecoded" flag
even if the decoding fails or subscription is dropped or any other
issues. Then this can work fine, IMO.
Thoughts? Holes in the theory? Other issues?
I am attaching my latest and greatest WIP patch with does not contain
any of the above abort handling yet.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_22_11_17.patchapplication/octet-stream; name=2pc_logical_22_11_17.patchDownload
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..56c6e7287f 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,84 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_no2pc
+ data
+------
+(0 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+ data
+------
+(0 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,49 +91,169 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
- data
--------------------------------------------------------------------------
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3';
+ COMMIT PREPARED 'test_prepared#3';
+(5 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_no2pc
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+------
+(0 rows)
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+-- FIXME we expect a timeout here, but it actually works...
+ERROR: statement timed out
+
+RESET statement_timeout;
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
--------------------------
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..a94503c8b8 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,36 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#1';
+:get_with2pc
+:get_no2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +41,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_with2pc
+:get_no2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_with2pc
+:get_no2pc
+
+-- If we do something that takes a strong lock on a catalog relation we need to
+-- read in order to decode a transaction we deadlock; we can't finish decoding
+-- until the lock is released, but we're waiting for decoding to finish so we
+-- can make a commit/abort decision.
+---
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- If we try to decode it now we'll deadlock
+SET statement_timeout = '10s';
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+-- we can decode past it by skipping xacts with catalog changes
+-- and let it be decoded after COMMIT PREPARED, though.
+:get_with2pc
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 135b3b7638..fb0deacfb3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,8 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +72,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
void
_PG_init(void)
@@ -85,9 +102,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +130,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
ctx->output_plugin_private = data;
@@ -176,6 +201,27 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -232,10 +278,163 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfoString(ctx->out, "COMMIT");
+
if (data->include_xids)
- appendStringInfo(ctx->out, "COMMIT %u", txn->xid);
- else
- appendStringInfoString(ctx->out, "COMMIT");
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special
+ * treatment.
+ *
+ * Right now we don't have a safe way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of
+ * decoding.
+ *
+ * That kind of problem arises only when we are trying to
+ * retrospectively decode aborted transactions with catalog changes -
+ * including if a transaction aborts while we're decoding it. If one
+ * wants to code distributed commit based on prepare decoding then
+ * commits/aborts will happend strictly after decoding will be
+ * completed, so it is possible to skip any checks/locks here.
+ *
+ * We'll also get stuck trying to acquire locks on catalog relations
+ * we need for decoding if the prepared xact holds a strong lock on
+ * one of them and we also need to decode row changes.
+ */
+ if (txn->has_catalog_changes)
+ {
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity, by default we just
+ * ignore in-progess prepared transactions with catalog
+ * changes in this extension. If they abort during
+ * decoding then tuples we need to decode them may be
+ * overwritten while we're still decoding, causing
+ * wrong catalog lookups.
+ *
+ * It is possible to move that LWLockRelease() to
+ * pg_decode_prepare_txn() and allow decoding of
+ * running prepared tx, but such lock will prevent any
+ * 2pc transaction commit during decoding time. That
+ * can be a long time in case of lots of
+ * changes/inserts in that tx or if the downstream is
+ * slow/unresonsive.
+ *
+ * (Continuing to decode without the lock is unsafe, XXX)
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return !data->twophase_decode_with_catalog_changes;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is
+ * not much sense in doing something with this
+ * transaction. Consequently ABORT PREPARED will be
+ * suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+ }
+
+ return false;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
if (data->include_timestamp)
appendStringInfo(ctx->out, " (at %s)",
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3aafa79e52..8756e4ed18 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -100,8 +100,13 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +144,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -166,8 +181,26 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
xl_xact_twophase *xl_twophase = (xl_xact_twophase *) data;
parsed->twophase_xid = xl_twophase->xid;
-
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ strcpy(parsed->twophase_gid, data);
+ data += strlen(parsed->twophase_gid) + 1;
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b715152e8d..c764c6c22b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -148,7 +148,6 @@ int max_prepared_xacts = 0;
* Note that the max value of GIDSIZE must fit in the uint16 gidlen,
* specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +210,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -898,7 +899,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +915,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1068,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1079,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1105,9 +1124,19 @@ EndPrepare(GlobalTransaction gxact)
MyPgXact->delayChkpt = true;
XLogBeginInsert();
+
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1312,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1435,11 +1501,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2165,7 +2232,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2261,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2323,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2347,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c06fabca10..e22622bfb2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1226,7 +1226,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1578,7 +1578,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5256,7 +5257,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5268,6 +5270,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5330,6 +5333,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5380,8 +5390,13 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5401,15 +5416,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5421,7 +5440,6 @@ XactLogAbortRecord(TimestampTz abort_time,
else
info = XLOG_XACT_ABORT_PREPARED;
-
/* First figure out and collect all the information needed */
xlrec.xact_time = abort_time;
@@ -5445,6 +5463,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5459,6 +5502,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5476,8 +5522,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ XLogRegisterData((char *) twophase_gid, gidlen);
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c988..7b2eec2402 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -71,7 +72,9 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_abort *parsed, TransactionId xid);
+ xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,17 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
- break;
+ /* check that output plugin capable of twophase decoding */
+ if (!ctx->twophase_hadling)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin wants this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
}
@@ -551,8 +570,13 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * this function were called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +631,81 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ /*
+ * We are processing COMMIT PREPARED and know that reorder buffer is
+ * empty. So we can skip use shortcut for coomiting bare xact.
+ */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+
+/*
+ * Decode PREPARE record. Same logic as in COMMIT, but diffent calls
+ * to SnapshotBuilder as we need to mark this transaction as commited
+ * instead of running to properly decode it. When prepared transation
+ * is decoded we mark it in snapshot as running again.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ SnapBuildPrepareTxnStart(ctx->snapshot_builder, buf->origptr, xid,
+ parsed->nsubxacts, parsed->subxacts);
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+
+ SnapBuildPrepareTxnFinish(ctx->snapshot_builder, xid);
}
/*
@@ -621,6 +717,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If that is ROLLBACK PREPARED than send that to callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bca585fc27..93ba3fbc5a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions
bool is_init);
static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -124,6 +132,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -182,8 +191,25 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all necessary callbacks to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->twophase_hadling = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks out of 3. "
+ "Twophase transactions will be decoded as ordinary ones.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -680,6 +706,93 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -714,6 +827,34 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9b126b2957..6952cbc28d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -101,6 +101,66 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0f607bab70..3d9598aab8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1307,25 +1307,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1604,8 +1597,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1632,8 +1628,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate or postpone that
+ * till the finish of two-phase tx
+ */
+ if (!txn_prepared(txn))
+ ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
{
@@ -1667,6 +1667,125 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as one-phase later on commit.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit non-twophase transaction. See comments to ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all transaction changes should be decoded on PREPARE.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to receiver.
+ * Called upon commit/abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then presumably subscriber confirmed prepare
+ * but we are rebooted.
+ */
+ return txn == NULL ? true : txn_prepared(txn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ad65b9831d..3ba6841770 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -901,7 +901,7 @@ SnapBuildPurgeCommittedTxn(SnapBuild *builder)
/* copy xids that still are interesting to workspace */
for (off = 0; off < builder->committed.xcnt; off++)
{
- if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+ if (TransactionIdPrecedes(builder->committed.xip[off],
builder->xmin))
; /* remove */
else
@@ -1079,6 +1079,52 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+/*
+ * Just a wrapper to clarify DecodePrepare().
+ * Right now we can't extract correct historic catalog data that
+ * was produced by aborted prepared transaction, so it work of
+ * decoding plugin to avoid such situation and here we just construct usual
+ * snapshot to able to decode prepare.
+ */
+void
+SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
+ int nsubxacts, TransactionId *subxacts)
+{
+ SnapBuildCommitTxn(builder, lsn, xid, nsubxacts, subxacts);
+}
+
+
+/*
+ * When decoding of preppare is finished we want should exclude our xid
+ * from list of committed xids to have correct snapshot between prepare
+ * and commit.
+ *
+ * However, this is not sctrictly needed. Prepared transaction holds locks
+ * between prepare and commit so nodody can produce new version of our
+ * catalog tuples. In case of abort we will have this xid in array of
+ * commited xids, but it also will not cause a problem since checks of
+ * HeapTupleHeaderXminInvalid() in HeapTupleSatisfiesHistoricMVCC()
+ * have higher priority then checks for xip array. Anyway let's be consistent
+ * about definitions and delete this xid from xip array.
+ */
+void
+SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid)
+{
+ TransactionId *search = bsearch(&xid, builder->committed.xip,
+ builder->committed.xcnt, sizeof(TransactionId), xidComparator);
+
+ if (search == NULL)
+ return;
+
+ /* delete that xid */
+ memmove(search, search + 1,
+ ((builder->committed.xip + builder->committed.xcnt - 1) - search) * sizeof(TransactionId));
+ builder->committed.xcnt--;
+
+ /* update min/max */
+ builder->xmin = builder->committed.xip[0];
+ builder->xmax = builder->committed.xip[builder->committed.xcnt - 1];
+}
/* -----------------------------------
* Snapshot building functions dealing with xlog records
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa5d9bb120..f1e91efeec 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -487,6 +487,121 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ /* TODO: what to do here for prepared transactions?? */
+ Assert(false);
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -888,6 +1003,10 @@ apply_dispatch(StringInfo s)
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c3126545b4..d55aa5b5a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -42,6 +42,14 @@ static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, char *gid);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,6 +87,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -254,6 +268,47 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
* Sends the decoded DML over wire.
*/
static void
@@ -364,6 +419,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ return false;
+}
+
+/*
* Currently we always forward.
*/
static bool
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 54dec4eeaf..11ff0511fd 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -57,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
XLogRecPtr end_lsn);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8432..1f093fb7b4 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,10 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +160,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +307,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +351,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +422,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
+
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7f0e0fa881..4a1ca4a2b9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of decoding plugin used.
+ */
+ bool twophase_hadling;
} LogicalDecodingContext;
@@ -111,5 +116,4 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
-
#endif
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index a9736e1bf6..99f0c50de8 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,18 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_PREPARE 0x02
+#define LOGICALREP_IS_COMMIT_PREPARED 0x04
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x08
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +88,12 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 26ff024882..11a7af7da8 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,38 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -98,6 +130,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeMessageCB message_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 86effe106b..ee18fa346b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,13 +138,28 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+
+/* TODO: convert existing bools into flags later */
+/* values for txn_flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_PREPARE 0x0004
+#define TXN_COMMIT_PREPARED 0x0008
+#define TXN_ROLLBACK_PREPARED 0x0010
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -292,6 +308,29 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -327,6 +366,10 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -382,6 +425,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -405,6 +453,13 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 7653717f83..7fcd479d8a 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -86,5 +86,9 @@ extern void SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
extern void SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn,
struct xl_running_xacts *running);
extern void SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn);
+extern void SnapBuildPrepareTxnStart(SnapBuild *builder, XLogRecPtr lsn,
+ TransactionId xid, int nsubxacts,
+ TransactionId *subxacts);
+extern void SnapBuildPrepareTxnFinish(SnapBuild *builder, TransactionId xid);
#endif /* SNAPBUILD_H */
On 23 November 2017 at 20:27, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:
Is there any other locking scenario that we need to consider?
Otherwise, are we all ok on this point being a non-issue for 2PC
logical decoding?
Yeah.
I didn't find any sort of sensible situation where locking would pose
issues. Unless you're taking explicit LOCKs on catalog tables, you should
be fine.
There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
relations I guess. Personally I put that in the "don't do that" box, but if
we really have to guard against it we could slightly expand the limits on
which txns you can PREPARE to any txn that has a strong lock on a catalog
relation.
The issue is with a concurrent rollback of the prepared transaction.
We need a way to ensure that
the 2PC does not abort when we are in the midst of a change record
apply activity.
The *reason* we care about this is that tuples created by aborted txns are
not considered "recently dead" by vacuum. They can be marked invalid and
removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
etc.
This could create an inconsistent view of the catalogs if our prepared txn
did any DDL. For example, we might've updated a pg_class row, so we created
a new row and set xmax on the old one. Vacuum may merrily remove our new
row so there's no way we can find the correct data anymore, we'd have to
find the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC
we'll see the old pg_class tuple.
Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.
The sanest option here seems to be to stop the txn from aborting while
we're decoding it, hence Nikhil's suggestions.
* We introduce two new booleans in the TwoPhaseState
GlobalTransactionData structure.
bool beingdecoded;
bool abortpending;
I think it's premature to rule out the simple option of doing a LockGXact
when we start decoding. Improve the error "prepared transaction with
identifier \"%s\" is busy" to report the locking pid too. It means you
cannot rollback or commit a prepared xact while it's being decoded, but for
the intended use of this patch, I think that's absolutely fine anyway.
But I like your suggestion much more than the idea of taking a LWLock on
TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and
LWLocks held over arbitrary user plugin code. Not viable.
With your way we just have to take a LWLock once on TwoPhaseState when we
test abortpending and set beingdecoded. After that, during decoding, we can
do unlocked tests of abortpending, since a stale read will do nothing worse
than delay our response a little. The existing 2PC ops already take the
LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
in a loop until beingdecoded was cleared, re-acquiring the LWLock and
re-checking each time it's woken. We should probably add a field there for
a waiter proc that wants its latch set, so 2pc ops don't usually have to
poll for decoding to finish. (Unless condition variables will help us here?)
However, let me make an entirely alternative suggestion. Should we add a
heavyweight lock class for 2PC xacts instead, and leverage the existing
infrastructure? We already use transaction locks widely after all. That
way, we just take some kind of share lock on the 2PC xact by xid when we
start logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT
PREPARED would acquire the same heavyweight lock in an exclusive mode
before grabbing TwoPhaseStateLock and doing their work.
That way we get automatic cleanup when decoding procs exit, we get wakeups
for waiters, etc, all for "free".
How practical is adding a lock class?
(Frankly I've often wished I could add new heavyweight lock classes when
working on complex extensions like BDR, too, and in an ideal world we'd be
able to register lock classes for use by extensions...)
3) Before starting decode of the next change record, we re-check if
"abortpending" is set. If "abortpending"
is set, we do not decode the next change record. Thus the abort is
delay-bounded to a maximum of one change record decoding/apply cycle
after we signal our intent to abort it. Then, we need to send ABORT
(regular, not rollback prepared, since we have not sent "PREPARE" yet.
Just to be explicit, this means "tell the downstream the xact has aborted".
Currently logical decoding does not ever start decoding an xact until it's
committed, so it has never needed an abort callback on the output plugin
interface.
But we'll need one when we start doing speculative logical decoding of big
txns before they commit, and we'll need it for this. It's relatively
trivial.
This abort_prepared callback will abort the dummy PREPARED query from
step (3) above. Instead of doing this, we could actually check if the
'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
But in that case we can't be sure if the GID does not exist because of
a rollback-during-decode-issue on the provider or due to something
else. If we are ok with not finding GIDs on the subscriber side, then
am fine with removing the DUMMY prepare from step (3).
I prefer the latter approach personally, not doing the dummy 2pc xact.
Instead we can just ignore a ROLLBACK PREPARED for a txn whose gid does not
exist on the downstream. I can easily see situations where we might
manually abort a txn and wouldn't want logical decoding to get perpetually
stuck waiting to abort a gid that doesn't exist, for example.
Ignoring commit prepared for a missing xact would not be great, but I think
it's sensible enough to ignore missing GIDs for rollback prepared.
We'd need a race-free way to do that though, so I think we'd have to
extend FinishPreparedTransaction and LockGXact with some kind of missing_ok
flag. I doubt that'd be controversial.
A couple of other considerations not covered in what you wrote:
- It's really important that the hook that decides whether to decode an
xact at prepare or commit prepared time reports the same answer each and
every time, including if it's called after a prepared txn has committed. It
probably can't look at anything more than the xact's origin replica
identity, xid, and gid. This also means we need to know the gid of prepared
txns when processing their commit record, so we can tell logical decoding
whether we already sent the data to the client at prepare-transaction time,
or if we should send it at commit-prepared time instead.
- You need to flush the syscache when you finish decoding a PREPARE
TRANSACTION of an xact that made catalog changes, unless it's immediately
followed by COMMIT PREPARED of the same xid. Because xacts between the two
cannot see changes the prepared xact made to the catalogs.
- For the same reason we need to ensure that the historic snapshot used to
decode a 2PC xact that made catalog changes isn't then used for subsequent
xacts between the prepare and commit. They'd see the uncommitted catalogs
of the prepared xact.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi Craig,
I didn't find any sort of sensible situation where locking would pose
issues. Unless you're taking explicit LOCKs on catalog tables, you should be
fine.There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
relations I guess. Personally I put that in the "don't do that" box, but if
we really have to guard against it we could slightly expand the limits on
which txns you can PREPARE to any txn that has a strong lock on a catalog
relation.
Well, we don't allow VACUUM FULL of regular tables in transaction blocks.
I tried "CLUSTER user_table USING pkey", it works and also it does not take
strong locks on catalog tables which would halt the decoding process.
ALTER TABLE
works without stalling decoding already as mentioned earlier.
The issue is with a concurrent rollback of the prepared transaction.
We need a way to ensure that
the 2PC does not abort when we are in the midst of a change record
apply activity.The *reason* we care about this is that tuples created by aborted txns are
not considered "recently dead" by vacuum. They can be marked invalid and
removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
etc.This could create an inconsistent view of the catalogs if our prepared txn
did any DDL. For example, we might've updated a pg_class row, so we created
a new row and set xmax on the old one. Vacuum may merrily remove our new row
so there's no way we can find the correct data anymore, we'd have to find
the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll
see the old pg_class tuple.Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.
We actually do the decoding in a PG_TRY/CATCH block, so if there are
any errors we
can clean those up in the CATCH block. If it's a prepared transaction
then we can send
an ABORT to the remote side to clean itself up.
The sanest option here seems to be to stop the txn from aborting while we're
decoding it, hence Nikhil's suggestions.
If we do the cleanup above in the CATCH block, then do we really care?
I guess the issue would be in determining why we reached the CATCH
block, whether it was due to a decoding error or due to network issues
or something else..
* We introduce two new booleans in the TwoPhaseState
GlobalTransactionData structure.
bool beingdecoded;
bool abortpending;I think it's premature to rule out the simple option of doing a LockGXact
when we start decoding. Improve the error "prepared transaction with
identifier \"%s\" is busy" to report the locking pid too. It means you
cannot rollback or commit a prepared xact while it's being decoded, but for
the intended use of this patch, I think that's absolutely fine anyway.But I like your suggestion much more than the idea of taking a LWLock on
TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and LWLocks
held over arbitrary user plugin code. Not viable.With your way we just have to take a LWLock once on TwoPhaseState when we
test abortpending and set beingdecoded. After that, during decoding, we can
do unlocked tests of abortpending, since a stale read will do nothing worse
than delay our response a little. The existing 2PC ops already take the
LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
in a loop until beingdecoded was cleared, re-acquiring the LWLock and
re-checking each time it's woken. We should probably add a field there for a
waiter proc that wants its latch set, so 2pc ops don't usually have to poll
for decoding to finish. (Unless condition variables will help us here?)
Yes, WaitLatch could do the job here.
However, let me make an entirely alternative suggestion. Should we add a
heavyweight lock class for 2PC xacts instead, and leverage the existing
infrastructure? We already use transaction locks widely after all. That way,
we just take some kind of share lock on the 2PC xact by xid when we start
logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT PREPARED
would acquire the same heavyweight lock in an exclusive mode before grabbing
TwoPhaseStateLock and doing their work.That way we get automatic cleanup when decoding procs exit, we get wakeups
for waiters, etc, all for "free".How practical is adding a lock class?
Am open to suggestions. This looks like it could work decently.
Just to be explicit, this means "tell the downstream the xact has aborted".
Currently logical decoding does not ever start decoding an xact until it's
committed, so it has never needed an abort callback on the output plugin
interface.But we'll need one when we start doing speculative logical decoding of big
txns before they commit, and we'll need it for this. It's relatively
trivial.
Yes, it will be a standard wrapper call to implement on both send and
apply side.
This abort_prepared callback will abort the dummy PREPARED query from
step (3) above. Instead of doing this, we could actually check if the
'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
But in that case we can't be sure if the GID does not exist because of
a rollback-during-decode-issue on the provider or due to something
else. If we are ok with not finding GIDs on the subscriber side, then
am fine with removing the DUMMY prepare from step (3).I prefer the latter approach personally, not doing the dummy 2pc xact.
Instead we can just ignore a ROLLBACK PREPARED for a txn whose gid does not
exist on the downstream. I can easily see situations where we might manually
abort a txn and wouldn't want logical decoding to get perpetually stuck
waiting to abort a gid that doesn't exist, for example.Ignoring commit prepared for a missing xact would not be great, but I think
it's sensible enough to ignore missing GIDs for rollback prepared.
Yes, that makes sense in case of ROLLBACK. If we find a missing GID
for a COMMIT PREPARE we are in for some trouble.
We'd need a race-free way to do that though, so I think we'd have to extend
FinishPreparedTransaction and LockGXact with some kind of missing_ok flag. I
doubt that'd be controversial.
Sure.
A couple of other considerations not covered in what you wrote:
- It's really important that the hook that decides whether to decode an xact
at prepare or commit prepared time reports the same answer each and every
time, including if it's called after a prepared txn has committed. It
probably can't look at anything more than the xact's origin replica
identity, xid, and gid. This also means we need to know the gid of prepared
txns when processing their commit record, so we can tell logical decoding
whether we already sent the data to the client at prepare-transaction time,
or if we should send it at commit-prepared time instead.
We already have a filter_prepare_cb hook in place for this. TBH, I
don't think this patch needs to worry about the internals of that
hook. Whatever it returns, if it returns the same value everytime then
we should be good from the logical decoding perspective
I think, if we encode the logic in the GID itself, then it will be
good and consistent everytime. For example, if the hook sees a GID
with the prefix '_$Logical_', then it knows it has to PREPARE it.
Others can be decoded at commit time.
- You need to flush the syscache when you finish decoding a PREPARE
TRANSACTION of an xact that made catalog changes, unless it's immediately
followed by COMMIT PREPARED of the same xid. Because xacts between the two
cannot see changes the prepared xact made to the catalogs.- For the same reason we need to ensure that the historic snapshot used to
decode a 2PC xact that made catalog changes isn't then used for subsequent
xacts between the prepare and commit. They'd see the uncommitted catalogs of
the prepared xact.
Yes, we will do TeardownHistoricSnapshot and syscache flush as part of
the cleanup for 2PC transactions.
Regards,
Nikhils
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 24 November 2017 at 13:44, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:
This could create an inconsistent view of the catalogs if our prepared
txn
did any DDL. For example, we might've updated a pg_class row, so we
created
a new row and set xmax on the old one. Vacuum may merrily remove our new
row
so there's no way we can find the correct data anymore, we'd have to find
the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll
see the old pg_class tuple.Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.We actually do the decoding in a PG_TRY/CATCH block, so if there are
any errors we
can clean those up in the CATCH block. If it's a prepared transaction
then we can send
an ABORT to the remote side to clean itself up.
Yeah. I suspect it might not always ERROR gracefully though.
How practical is adding a lock class?
Am open to suggestions. This looks like it could work decently.
It looks amazingly simple from here. Which probably means there's more to
it that I haven't seen yet. I could use advice from someone who knows the
locking subsystem better.
Yes, that makes sense in case of ROLLBACK. If we find a missing GID
for a COMMIT PREPARE we are in for some trouble.
I agree. But it's really down to the apply worker / plugin to set policy
there, I think. It's not the 2PC decoding support's problem.
I'd argue that a plugin that wishes to strictly track and match 2PC aborts
with the subsequent ROLLBACK PREPARED could do so by recording the abort
locally. It need not rely on faked-up 2pc xacts from the output plugin.
Though it might choose to create them on the downstream as its method of
tracking aborts.
In other words, we don't need the logical decoding infrastructure's help
here. It doesn't have to fake up 2PC xacts for us. Output plugins/apply
workers that want to can do it themselves, and those that don't can ignore
rollback prepared for non-matched GIDs instead.
We'd need a race-free way to do that though, so I think we'd have to
extendFinishPreparedTransaction and LockGXact with some kind of missing_ok
flag. I
doubt that'd be controversial.
Sure.
I reckon that should be in-scope for this patch, and pretty clearly useful.
Also simple.
- It's really important that the hook that decides whether to decode an
xact
at prepare or commit prepared time reports the same answer each and every
time, including if it's called after a prepared txn has committed. It
probably can't look at anything more than the xact's origin replica
identity, xid, and gid. This also means we need to know the gid ofprepared
txns when processing their commit record, so we can tell logical decoding
whether we already sent the data to the client at prepare-transactiontime,
or if we should send it at commit-prepared time instead.
We already have a filter_prepare_cb hook in place for this. TBH, I
don't think this patch needs to worry about the internals of that
hook. Whatever it returns, if it returns the same value everytime then
we should be good from the logical decoding perspective.
I agree. I meant that it should try to pass only info that's accessible at
both PREPARE TRANSACTION and COMMIT PREPARED time, and we should document
the importance of returning a consistent result. In particular, it's always
wrong to examine the current twophase state when deciding what to return.
I think, if we encode the logic in the GID itself, then it will be
good and consistent everytime. For example, if the hook sees a GID
with the prefix '_$Logical_', then it knows it has to PREPARE it.
Others can be decoded at commit time.
Yep. We can also safely tell the hook:
* the xid
* whether the xact has made catalog changes (since we know that at prepare
and commit time)
but probably not much else.
- You need to flush the syscache when you finish decoding a PREPARE
TRANSACTION of an xact that made catalog changes, unless it's immediately
followed by COMMIT PREPARED of the same xid. Because xacts between thetwo
cannot see changes the prepared xact made to the catalogs.
- For the same reason we need to ensure that the historic snapshot used
to
decode a 2PC xact that made catalog changes isn't then used for
subsequent
xacts between the prepare and commit. They'd see the uncommitted
catalogs of
the prepared xact.
Yes, we will do TeardownHistoricSnapshot and syscache flush as part of
the cleanup for 2PC transactions.
Great.
Thanks.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 24, 2017 at 3:41 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
It looks amazingly simple from here. Which probably means there's more to it
that I haven't seen yet. I could use advice from someone who knows the
locking subsystem better.
The status of this patch is I think not correct. It is marked as
waiting on author but Nikhil has showed up and has written an updated
patch. So I am moving it to next CF with "needs review".
--
Michael
Hi,
On 24/11/17 07:41, Craig Ringer wrote:
On 24 November 2017 at 13:44, Nikhil Sontakke <nikhils@2ndquadrant.com
How practical is adding a lock class?
Am open to suggestions. This looks like it could work decently.
It looks amazingly simple from here. Which probably means there's more
to it that I haven't seen yet. I could use advice from someone who knows
the locking subsystem better.
Hmm, I don't like the interaction that would have with ROLLBACK, meaning
that ROLLBACK has to wait for decoding to finish which may take longer
than the transaction took itself (given potential network calls, it's
practically unbounded time).
I also think that if we'll want to add streaming of transactions in the
future, we'll face similar problem and the locking approach will not
work there as the transaction may still be locked by the owning backend
while we are decoding it.
From my perspective this patch changes the assumption in
HeapTupleSatisfiesVacuum() that changes done by aborted transaction
can't be seen by anybody else. That's clearly not true here as the
decoding can see it. So perhaps better approach would be to not return
HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same
logic we use for deleted tuples of committed transactions) in the
HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly
checked HOT pruning and AFAICS the normal HOT pruning (the one not
called by vacuum) also uses the xmin as authoritative even for aborted
txes so nothing needs to be done there probably.
In case we are worried that this affects cleanups of for example large
aborted COPY transactions and we think it's worth worrying about then we
could limit the new OldestXmin based logic only to catalog tuples as
those are the only ones we need available in decoding.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 30 November 2017 at 07:40, Petr Jelinek <petr.jelinek@2ndquadrant.com>
wrote:
Hi,
On 24/11/17 07:41, Craig Ringer wrote:
On 24 November 2017 at 13:44, Nikhil Sontakke <nikhils@2ndquadrant.com
How practical is adding a lock class?
Am open to suggestions. This looks like it could work decently.
It looks amazingly simple from here. Which probably means there's more
to it that I haven't seen yet. I could use advice from someone who knows
the locking subsystem better.Hmm, I don't like the interaction that would have with ROLLBACK, meaning
that ROLLBACK has to wait for decoding to finish which may take longer
than the transaction took itself (given potential network calls, it's
practically unbounded time).
Yeah. We could check for waiters before we do the network I/O and release +
bail out. But once we enter the network call we're committed and it could
take a long time.
I don't find that particularly troubling for 2PC, but it's an obvious
nonstarter if we want to use the same mechanism for streaming normal xacts
out before commit.
Even for 2PC, if we have >1 downstream then once one reports an ERROR on
PREPARE TRANSACTION, there's probably no point continuing to stream the 2PC
xact out to other peers. So being able to abort the txn while it's being
decoded, causing decoding to bail out, is desirable there too.
I also think that if we'll want to add streaming of transactions in the
future, we'll face similar problem and the locking approach will not
work there as the transaction may still be locked by the owning backend
while we are decoding it.
Agreed. For that reason I agree that we need to look further afield than
locking-based solutions.
From my perspective this patch changes the assumption in
HeapTupleSatisfiesVacuum() that changes done by aborted transaction
can't be seen by anybody else. That's clearly not true here as the
decoding can see it.
Yes, *if* we don't use some locking-like approach to stop abort from
occurring while decoding is processing an xact.
So perhaps better approach would be to not return
HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same
logic we use for deleted tuples of committed transactions) in the
HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly
checked HOT pruning and AFAICS the normal HOT pruning (the one not
called by vacuum) also uses the xmin as authoritative even for aborted
txes so nothing needs to be done there probably.In case we are worried that this affects cleanups of for example large
aborted COPY transactions and we think it's worth worrying about then we
could limit the new OldestXmin based logic only to catalog tuples as
those are the only ones we need available in decoding.
Yeah, if it's limited to catalog tuples only then that sounds good. I was
quite concerned about how it'd impact vacuuming otherwise, but if limited
to catalogs about the only impact should be on workloads that create lots
of TEMPORARY tables then ROLLBACK - and not much on those.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
So perhaps better approach would be to not return
HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same
logic we use for deleted tuples of committed transactions) in the
HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly
checked HOT pruning and AFAICS the normal HOT pruning (the one not
called by vacuum) also uses the xmin as authoritative even for aborted
txes so nothing needs to be done there probably.In case we are worried that this affects cleanups of for example large
aborted COPY transactions and we think it's worth worrying about then we
could limit the new OldestXmin based logic only to catalog tuples as
those are the only ones we need available in decoding.Yeah, if it's limited to catalog tuples only then that sounds good. I was
quite concerned about how it'd impact vacuuming otherwise, but if limited to
catalogs about the only impact should be on workloads that create lots of
TEMPORARY tables then ROLLBACK - and not much on those.
Based on these discussions, I think there are two separate issues here:
1) Make HeapTupleSatisfiesVacuum() to behave differently for recently
aborted catalog tuples.
2) Invent a mechanism to stop a specific logical decoding activity in
the middle. The reason to stop it could be a concurrent abort, maybe a
global transaction manager decides to rollback, or any other reason,
for example.
ISTM, that for 2, if (1) is able to leave the recently abort tuples
around for a little bit while (we only really need them till the
decode of the current change record is ongoing), then we could
accomplish it via a callback. This callback should be called before
commencing decode and network send of each change record. In case of
in-core logical decoding, the callback for pgoutput could check for
the transaction having aborted (a call to TransactionIdDidAbort() or
similar such functions), additional logic can be added as needed for
various scenarios. If it's aborted, we will abandon decoding and send
an ABORT to the subscribers before returning.
Regards,
Nikhils
PFA, latest patch for this functionality.
This patch contains the following changes as compared to the earlier patch:
- Fixed a bunch of typos and comments
- Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD
if the transaction id is newer than OldestXmin. Doing this only for
CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId).
- Added a filter callback filter_decode_txn_cb_wrapper() to decide if
it's ok to decode the NEXT change record. This filter as of now checks
if the XID that is involved got aborted. Additional checks can be
added here as needed.
- Added ABORT callback in the decoding process. This was not needed
before because we always used to decode committed transactions. With
2PC transactions, it possible that while we are decoding it, another
backend might issue a concurrent ROLLBACK PREPARED. So when
filter_decode_txn_cb_wrapper() gets called, it will tell us to not to
decode the next change record. In that case we need to send an ABORT
to the subscriber (and not ROLLBACK PREPARED because we are yet to
issue PREPARE to the subscriber)
- Added all functionality to read the abort command and apply it on
the remote subscriber as needed.
- Added functionality in ReorderBufferCommit() to abort midways based
on the feedback from filter_decode_txn_cb_wrapper()
- Modified LockGXact() and FinishPreparedTransaction() to allow
missing GID in case of "ROLLBACK PREPARED". Currently, this will only
happen in the logical apply code path. We still send it to the
subscriber because it's difficult to identify on the provider if this
transaction was aborted midways in decoding or if it's in PREPARED
state on the subscriber. It will error out as before in all other
cases.
- Totally removed snapshot addition/deletion code while doing the
decoding. That's not needed at all while decoding an ongoing
transaction. The entries in the snapshot are needed for future
transactions to be able to decode older transactions. For 2PC
transactions, we don't need to decode them till COMMIT PREPARED gets
called. This has simplified all that unwanted snapshot push/pop code,
which is nice.
Regards,
Nikhils
On 30 November 2017 at 16:08, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
Hi,
So perhaps better approach would be to not return
HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same
logic we use for deleted tuples of committed transactions) in the
HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly
checked HOT pruning and AFAICS the normal HOT pruning (the one not
called by vacuum) also uses the xmin as authoritative even for aborted
txes so nothing needs to be done there probably.In case we are worried that this affects cleanups of for example large
aborted COPY transactions and we think it's worth worrying about then we
could limit the new OldestXmin based logic only to catalog tuples as
those are the only ones we need available in decoding.Yeah, if it's limited to catalog tuples only then that sounds good. I was
quite concerned about how it'd impact vacuuming otherwise, but if limited to
catalogs about the only impact should be on workloads that create lots of
TEMPORARY tables then ROLLBACK - and not much on those.Based on these discussions, I think there are two separate issues here:
1) Make HeapTupleSatisfiesVacuum() to behave differently for recently
aborted catalog tuples.2) Invent a mechanism to stop a specific logical decoding activity in
the middle. The reason to stop it could be a concurrent abort, maybe a
global transaction manager decides to rollback, or any other reason,
for example.ISTM, that for 2, if (1) is able to leave the recently abort tuples
around for a little bit while (we only really need them till the
decode of the current change record is ongoing), then we could
accomplish it via a callback. This callback should be called before
commencing decode and network send of each change record. In case of
in-core logical decoding, the callback for pgoutput could check for
the transaction having aborted (a call to TransactionIdDidAbort() or
similar such functions), additional logic can be added as needed for
various scenarios. If it's aborted, we will abandon decoding and send
an ABORT to the subscribers before returning.Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_04_12_17.patchapplication/octet-stream; name=2pc_logical_04_12_17.patchDownload
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..f9676b2e01 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------
+ ABORT PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+----------------------------------
+ ABORT PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..132b30e97b 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,4 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 135b3b7638..362683feef 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,8 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -68,6 +72,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -88,6 +104,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +127,8 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
ctx->output_plugin_private = data;
@@ -176,6 +198,27 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -244,6 +287,164 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ /*
+ * Two-phase transactions that accessed catalog require special
+ * treatment.
+ *
+ * Right now we don't have a safe way to decode catalog changes made in
+ * prepared transaction that was already aborted by the time of
+ * decoding.
+ *
+ * That kind of problem arises only when we are trying to
+ * retrospectively decode aborted transactions with catalog changes -
+ * including if a transaction aborts while we're decoding it. If one
+ * wants to code distributed commit based on prepare decoding then
+ * commits/aborts will happend strictly after decoding will be
+ * completed, so it is possible to skip any checks/locks here.
+ *
+ * We'll also get stuck trying to acquire locks on catalog relations
+ * we need for decoding if the prepared xact holds a strong lock on
+ * one of them and we also need to decode row changes.
+ */
+ if (!txn->has_catalog_changes)
+ return false;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+ if (TransactionIdIsInProgress(txn->xid))
+ {
+ /*
+ * For the sake of simplicity, by default we just
+ * ignore in-progess prepared transactions with catalog
+ * changes in this extension. If they abort during
+ * decoding then tuples we need to decode them may be
+ * overwritten while we're still decoding, causing
+ * wrong catalog lookups.
+ *
+ * It is possible to move that LWLockRelease() to
+ * pg_decode_prepare_txn() and allow decoding of
+ * running prepared tx, but such lock will prevent any
+ * 2pc transaction commit during decoding time. That
+ * can be a long time in case of lots of
+ * changes/inserts in that tx or if the downstream is
+ * slow/unresonsive.
+ *
+ * (Continuing to decode without the lock is unsafe, XXX)
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return !data->twophase_decode_with_catalog_changes;
+ }
+ else if (TransactionIdDidAbort(txn->xid))
+ {
+ /*
+ * Here we know that it is already aborted and there is
+ * not much sense in doing something with this
+ * transaction. Consequently ABORT PREPARED will be
+ * suppressed.
+ */
+ LWLockRelease(TwoPhaseStateLock);
+ return true;
+ }
+ LWLockRelease(TwoPhaseStateLock);
+ /*
+ * XXX: Transaction is not in progress, so buf->origptr lags behind
+ * ctx->snapshot_builder.start_decoding_at.
+ * If we did decode it (ie didn't filtered it above), we should not filter
+ * it out here either, so it will be passed to ReorderBufferForget in
+ * DecodePrepare.
+ * If we had to skip it previously, we have to skip it now either, so
+ * whole transaction will be decoded as simple non-2phase transaction.
+ */
+ return !data->twophase_decode_with_catalog_changes;
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes && !data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes && !data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ABORT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3aafa79e52..1a4487d404 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b715152e8d..792408c94d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(isCommit && missing_ok);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,11 +1510,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2165,7 +2241,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2270,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2332,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2356,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c06fabca10..d751267b51 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1226,7 +1226,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1578,7 +1578,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5243,7 +5244,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5256,7 +5256,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5268,6 +5269,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5330,6 +5332,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5380,8 +5389,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5401,15 +5421,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5445,6 +5469,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5459,6 +5508,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5476,7 +5528,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c988..c9231f4973 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,16 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -551,8 +571,14 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * these function have already been called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid,
+ parsed->twophase_gid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +633,69 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -621,6 +707,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bca585fc27..ff23002cf3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -124,6 +136,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -182,8 +195,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -680,6 +712,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -714,6 +862,61 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9b126b2957..77b9f58ae2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index dc0ad5b0e7..91b2a76fa7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1261,25 +1261,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1323,6 +1316,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status;
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
@@ -1337,6 +1341,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "stopping decoding of (%u)", txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1543,6 +1561,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1558,8 +1577,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort */
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1586,7 +1616,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We cleanup even for 2PC transactions, this is
+ * because the commit prepared might be some time
+ * away. Also that does not need this data to be
+ * around for processing anyways.
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1621,6 +1658,137 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then employ the callback to see if this txn
+ * was send at PREPARE time. The callback should return the same
+ * answer for a given GID, everytime we call it.
+ */
+ if (txn == NULL)
+ return !(rb->filter_prepare(rb, NULL, gid));
+ else
+ return txn_prepared(txn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa5d9bb120..71ff19ded5 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,121 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ /* TODO: what to do here for prepared transactions?? */
+ Assert(false);
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1004,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c3126545b4..4bbad5b21d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -254,6 +274,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
* Sends the decoded DML over wire.
*/
static void
@@ -364,6 +439,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ const char *gid)
+{
+ return false;
+}
+
+/*
* Currently we always forward.
*/
static bool
@@ -374,6 +461,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
}
/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out :-) */
+ return (txn != NULL)? false:true;
+}
+
+/*
* Shutdown the output plugin.
*
* Note, we don't need to clean the data->context as it's child context
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 82a707af7b..792cd9c868 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -454,13 +454,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index a821e2eed1..0db65bca1b 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1252,6 +1252,19 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
*/
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
+
+ /*
+ * Transaction aborted, but perhaps it was recent enough
+ * that some open transactions could still see the tuple.
+ * We restrict the scope of this check to activities on
+ * catalog tables only, because logical decoding could be
+ * peaking into such tuples for a short while
+ */
+ if ((htup->t_tableOid < (Oid) FirstNormalObjectId) &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetRawXmin(tuple),
+ OldestXmin))
+ return HEAPTUPLE_RECENTLY_DEAD;
+
return HEAPTUPLE_DEAD;
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 54dec4eeaf..c552d38367 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,12 +47,15 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
XLogRecPtr end_lsn);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8432..118156ed78 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7f0e0fa881..758de40db9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index a9736e1bf6..7f51f75b97 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 26ff024882..c92ace3838 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,45 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -85,6 +124,12 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
RepOriginId origin_id);
/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
@@ -98,8 +143,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b18ce5a9df..933f13a174 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,13 +138,28 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+
+/* TODO: convert existing bools into flags later */
+/* values for txn_flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_PREPARE 0x0004
+#define TXN_COMMIT_PREPARED 0x0008
+#define TXN_ROLLBACK_PREPARED 0x0010
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -292,6 +308,39 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -327,6 +376,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -369,6 +424,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -392,6 +452,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
On 4 December 2017 at 23:15, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:
PFA, latest patch for this functionality.
This patch contains the following changes as compared to the earlier patch:- Fixed a bunch of typos and comments
- Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD
if the transaction id is newer than OldestXmin. Doing this only for
CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId).
Because logical decoding supports user-catalog relations, we need to use
the same sort of logical that GetOldestXmin uses instead of a simple
oid-range check. See RelationIsAccessibleInLogicalDecoding() and the
user_catalog_table reloption.
Otherwise pseudo-catalogs used by logical decoding output plugins could
still suffer issues with needed tuples getting vacuumed, though only if the
txn being decoded made changes to those tables than ROLLBACKed. It's a
pretty tiny corner case for decoding of 2pc but a bigger one when we're
addressing streaming decoding.
Otherwise I'm really, really happy with how this is progressing and want to
find time to play with it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
- Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD
if the transaction id is newer than OldestXmin. Doing this only for
CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId).Because logical decoding supports user-catalog relations, we need to use the
same sort of logical that GetOldestXmin uses instead of a simple oid-range
check. See RelationIsAccessibleInLogicalDecoding() and the
user_catalog_table reloption.
Unfortunately, HeapTupleSatisfiesVacuum does not have the Relation
structure handily available to allow for these checks..
Otherwise pseudo-catalogs used by logical decoding output plugins could
still suffer issues with needed tuples getting vacuumed, though only if the
txn being decoded made changes to those tables than ROLLBACKed. It's a
pretty tiny corner case for decoding of 2pc but a bigger one when we're
addressing streaming decoding.
We disallow rewrites on user_catalog_tables, so they cannot change
underneath. Yes, DML can be carried out on them inside a 2PC
transaction which then gets ROLLBACK'ed. But if it's getting aborted,
then we are not interested in that data anyways. Also, now that we
have the "filter_decode_txn_cb_wrapper()" function, we will stop
decoding by the next change record cycle because of the abort.
So, I am not sure if we need to track user_catalog_tables in
HeapTupleSatisfiesVacuum explicitly.
Otherwise I'm really, really happy with how this is progressing and want to
find time to play with it.
Yeah, I will do some more testing and add a few more test cases in the
test_decoding plugin. It might be handy to have a DELAY of a few
seconds after every change record processing, for example. That ways,
we can have a TAP test which can do a few WAL activities and then we
introduce a concurrent rollback midways from another session in the
middle of that delayed processing. I have done debugger based testing
of this concurrent rollback functionality as of now.
Another test (actually, functionality) that might come in handy, is to
have a way for DDL to be actually carried out on the subscriber. We
will need something like pglogical.replicate_ddl_command to be added
to the core for this to work. We can add this functionality as a
follow-on separate patch after discussing how we want to implement
that in core.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 5 December 2017 at 16:00, Nikhil Sontakke <nikhils@2ndquadrant.com>
wrote:
We disallow rewrites on user_catalog_tables, so they cannot change
underneath. Yes, DML can be carried out on them inside a 2PC
transaction which then gets ROLLBACK'ed. But if it's getting aborted,
then we are not interested in that data anyways. Also, now that we
have the "filter_decode_txn_cb_wrapper()" function, we will stop
decoding by the next change record cycle because of the abort.So, I am not sure if we need to track user_catalog_tables in
HeapTupleSatisfiesVacuum explicitly.
I guess it's down to whether, when we're decoding a txn that just got
concurrently aborted, the output plugin might do anything with its user
catalogs that could cause a crash.
Output plugins are most likely to be using the genam (or even SPI, I
guess?) to read user-catalogs during logical decoding. Logical decoding its
self does not rely on the correctness of user catalogs in any way, it's
only a concern for output plugin callbacks.
It may make sense to kick this one down the road at this point, I can't
conclusively see where it'd cause an actual problem.
Otherwise I'm really, really happy with how this is progressing and want
to
find time to play with it.
Yeah, I will do some more testing and add a few more test cases in the
test_decoding plugin. It might be handy to have a DELAY of a few
seconds after every change record processing, for example. That ways,
we can have a TAP test which can do a few WAL activities and then we
introduce a concurrent rollback midways from another session in the
middle of that delayed processing. I have done debugger based testing
of this concurrent rollback functionality as of now.
Sounds good.
Another test (actually, functionality) that might come in handy, is to
have a way for DDL to be actually carried out on the subscriber. We
will need something like pglogical.replicate_ddl_command to be added
to the core for this to work. We can add this functionality as a
follow-on separate patch after discussing how we want to implement
that in core.
Yeah, definitely a different patch, but assuredly valuable.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12/4/17 10:15, Nikhil Sontakke wrote:
PFA, latest patch for this functionality.
This probably needs documentation updates for the logical decoding chapter.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/7/17 08:31, Peter Eisentraut wrote:
On 12/4/17 10:15, Nikhil Sontakke wrote:
PFA, latest patch for this functionality.
This probably needs documentation updates for the logical decoding chapter.
You need the attached patch to be able to compile without warnings.
Also, the regression tests crash randomly for me at
frame #4: 0x000000010a6febdb
postgres`heap_prune_record_prunable(prstate=0x00007ffee5578990, xid=0)
at pruneheap.c:625
622 * This should exactly match the PageSetPrunable macro. We
can't store
623 * directly into the page header yet, so we update working state.
624 */
-> 625 Assert(TransactionIdIsNormal(xid));
626 if (!TransactionIdIsValid(prstate->new_prune_xid) ||
627 TransactionIdPrecedes(xid, prstate->new_prune_xid))
628 prstate->new_prune_xid = xid;
Did you build with --enable-cassert?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-fixup-Original-patch.patchtext/plain; charset=UTF-8; name=0001-fixup-Original-patch.patch; x-mac-creator=0; x-mac-type=0Download
From 0beddc4b47d160d4fcd9e99d23a90c4273b32c41 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Thu, 7 Dec 2017 14:42:04 -0500
Subject: [PATCH] fixup! Original patch
---
contrib/test_decoding/test_decoding.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 362683feef..a709e2ff92 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -74,7 +74,7 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
Size sz, const char *message);
static bool pg_filter_prepare(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
- char *gid);
+ const char *gid);
static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn);
@@ -290,7 +290,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
/* Filter out unnecessary two-phase transactions */
static bool
pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
- char *gid)
+ const char *gid)
{
TestDecodingData *data = ctx->output_plugin_private;
--
2.15.1
Hi,
Thanks for the warning fix, I will also look at the cassert case soon.
I have been adding more test cases to this patch. I added a TAP test
which now allows us to do a concurrent ROLLBACK PREPARED when the
walsender is in the midst of decoding this very prepared transaction.
Have added a "decode-delay" parameter to test_decoding via which each
apply call sleeps for a few configurable number of seconds allowing us
to have deterministic rollback in parallel. This logic seems to work
ok.
However, I am battling an issue with invalidations now. Consider the
below test case:
CREATE TABLE test_prepared1(id integer primary key);
-- test prepared xact containing ddl
BEGIN; INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
COMMIT PREPARED 'test_prepared#3';
SELECT data FROM pg_logical_slot_get_changes(..) <-- this shows the
2PC being decoded appropriately
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
SELECT data FROM pg_logical_slot_get_changes(..)
The last pg_logical_slot_get_changes call, shows:
table public.test_prepared1: INSERT: id[integer]:8
whereas since the 2PC committed, it should have shown:
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
This is an issue because of the way we are handling invalidations. We
don't allow ReorderBufferAddInvalidations() at COMMIT PREPARE time
since we assume that handling them at PREPARE time is enough.
Apparently, it's not enough. Am trying to allow invalidations at
COMMIT PREPARE time as well, but maybe calling
ReorderBufferAddInvalidations() blindly again is not a good idea.
Also, if I do that, then I am getting some restart_lsn inconsistencies
which causes subsequent pg_logical_slot_get_changes() calls to
re-decode older records. I continue to investigate.
I am attaching the latest WIP patch. This contains the additional TAP
test changes.
Regards,
Nikhils
On 8 December 2017 at 01:15, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 12/7/17 08:31, Peter Eisentraut wrote:
On 12/4/17 10:15, Nikhil Sontakke wrote:
PFA, latest patch for this functionality.
This probably needs documentation updates for the logical decoding chapter.
You need the attached patch to be able to compile without warnings.
Also, the regression tests crash randomly for me at
frame #4: 0x000000010a6febdb
postgres`heap_prune_record_prunable(prstate=0x00007ffee5578990, xid=0)
at pruneheap.c:625
622 * This should exactly match the PageSetPrunable macro. We
can't store
623 * directly into the page header yet, so we update working state.
624 */
-> 625 Assert(TransactionIdIsNormal(xid));
626 if (!TransactionIdIsValid(prstate->new_prune_xid) ||
627 TransactionIdPrecedes(xid, prstate->new_prune_xid))
628 prstate->new_prune_xid = xid;Did you build with --enable-cassert?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_12_12_17_wip.patchapplication/octet-stream; name=2pc_logical_12_12_17_wip.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..f9676b2e01 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------
+ ABORT PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+----------------------------------
+ ABORT PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..4197766c50 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,5 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100755
index 0000000000..36814bcbb0
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,66 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We will fire off a ROLLBACK from another session when this
+# delayed decode is ongoing. That will stop decoding immediately and the next
+# pg_logical_slot_get_changes call should show only a few records decoded from
+# the entire two phase transaction
+
+# consume all changes so far
+#$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check for occurrence of log about stopping decoding
+my $output_file = slurp_file($node_logical->logfile());
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 135b3b7638..e1c7bb37c9 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,9 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
+ int decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +64,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +75,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ const char *gid);
+static bool pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->filter_decode_txn_cb = pg_filter_decode_txn;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +134,9 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -176,6 +206,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -244,6 +310,153 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transaction as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ if (txn && txn->has_catalog_changes &&
+ !data->twophase_decode_with_catalog_changes)
+ return true;
+
+ return false;
+}
+
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out :-) */
+ return (txn != NULL)? false:true;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -412,6 +625,10 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ /* if decode_delay is specified, sleep for those many seconds */
+ if (data->decode_delay > 0)
+ pg_usleep(data->decode_delay * 1000000L);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3aafa79e52..1a4487d404 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b715152e8d..792408c94d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(isCommit && missing_ok);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,11 +1510,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -2165,7 +2241,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2270,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2332,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2356,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c06fabca10..d751267b51 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1226,7 +1226,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1578,7 +1578,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5243,7 +5244,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5256,7 +5256,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5268,6 +5269,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5330,6 +5332,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5380,8 +5389,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5401,15 +5421,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5445,6 +5469,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5459,6 +5508,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5476,7 +5528,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c988..c9231f4973 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,16 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -551,8 +571,14 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* Process invalidation messages, even if we're not interested in the
* transaction's contents, since the various caches need to always be
* consistent.
+ *
+ * Also if that transaction was sent to prepare callback then both
+ * these function have already been called during prepare.
*/
- if (parsed->nmsgs > 0)
+ if (parsed->nmsgs > 0 &&
+ !(TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid,
+ parsed->twophase_gid)))
{
ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
parsed->nmsgs, parsed->msgs);
@@ -607,9 +633,69 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -621,6 +707,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bca585fc27..0925f17e29 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -124,6 +136,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -182,8 +195,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -680,6 +712,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -714,6 +862,61 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
@@ -992,4 +1195,5 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
MyReplicationSlot->data.confirmed_flush = lsn;
SpinLockRelease(&MyReplicationSlot->mutex);
}
+ elog(NOTICE, "restart_lsn : %lu", MyReplicationSlot->data.restart_lsn);
}
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index a3ba2b1266..3e77d8160d 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -283,6 +283,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
/* invalidate non-timetravel entries */
InvalidateSystemCaches();
+ elog(NOTICE, "startptr : %lu", MyReplicationSlot->data.restart_lsn);
/* Decode until we run out of records */
while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) ||
(ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal))
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9b126b2957..77b9f58ae2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index dc0ad5b0e7..4c9cf0257d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1261,25 +1261,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1323,6 +1316,18 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status;
+ bool is_prepared = txn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
@@ -1337,6 +1342,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1543,6 +1564,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1558,8 +1580,19 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort */
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1586,8 +1619,35 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
- ReorderBufferCleanupTXN(rb, txn);
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We don't remove it for prepared transactions
+ * though. It might make sense to look at exactly
+ * what fields need to stay around for the COMMIT
+ * PREPARED and clean up the rest, as an
+ * optimization.
+ */
+ if (!txn_prepared(txn))
+ ReorderBufferCleanupTXN(rb, txn);
+ /*else
+ {
+ ReorderBufferTXN *new_txn;
+ char gid[GIDSIZE];
+ bool has_catalog_changes = txn->has_catalog_changes;
+ XLogRecPtr first_lsn = txn->first_lsn;
+ *
+ * clean it up but re-add a dummy txn for lookups by
+ * COMMIT|ROLLBACK PREPARED in the future
+ *
+ strcpy(gid, txn->gid);
+ ReorderBufferCleanupTXN(rb, txn);
+ new_txn = ReorderBufferTXNByXid(rb, xid, true, NULL, first_lsn,
+ true);
+ new_txn->has_catalog_changes = has_catalog_changes;
+ new_txn->txn_flags |= TXN_PREPARE;
+ strcpy(new_txn->gid, gid);
+ }*/
}
PG_CATCH();
{
@@ -1621,6 +1681,147 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * If txn == NULL then employ the callback to see if this txn
+ * was send at PREPARE time. The callback should return the same
+ * answer for a given GID, everytime we call it.
+ */
+ if (txn == NULL)
+ return !(rb->filter_prepare(rb, NULL, gid));
+ else
+ return txn_prepared(txn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ /* build data to be able to lookup the CommandIds of catalog tuples */
+ ReorderBufferBuildTupleCidHash(rb, txn);
+
+ /* setup the initial snapshot */
+ SetupHistoricSnapshot(txn->base_snapshot, txn->tuplecid_hash);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup */
+ TeardownHistoricSnapshot(false);
+ /* make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa5d9bb120..71ff19ded5 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,121 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ /* TODO: what to do here for prepared transactions?? */
+ Assert(false);
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1004,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c3126545b4..4bbad5b21d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -254,6 +274,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
* Sends the decoded DML over wire.
*/
static void
@@ -364,6 +439,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ const char *gid)
+{
+ return false;
+}
+
+/*
* Currently we always forward.
*/
static bool
@@ -374,6 +461,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
}
/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out :-) */
+ return (txn != NULL)? false:true;
+}
+
+/*
* Shutdown the output plugin.
*
* Note, we don't need to clean the data->context as it's child context
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 82a707af7b..792cd9c868 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -454,13 +454,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index a821e2eed1..0db65bca1b 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1252,6 +1252,19 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
*/
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
+
+ /*
+ * Transaction aborted, but perhaps it was recent enough
+ * that some open transactions could still see the tuple.
+ * We restrict the scope of this check to activities on
+ * catalog tables only, because logical decoding could be
+ * peaking into such tuples for a short while
+ */
+ if ((htup->t_tableOid < (Oid) FirstNormalObjectId) &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetRawXmin(tuple),
+ OldestXmin))
+ return HEAPTUPLE_RECENTLY_DEAD;
+
return HEAPTUPLE_DEAD;
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 54dec4eeaf..c552d38367 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,12 +47,15 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
XLogRecPtr end_lsn);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8432..118156ed78 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7f0e0fa881..758de40db9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index a9736e1bf6..7f51f75b97 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 26ff024882..c92ace3838 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,45 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -85,6 +124,12 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
RepOriginId origin_id);
/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
@@ -98,8 +143,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b18ce5a9df..933f13a174 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,13 +138,28 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+
+/* TODO: convert existing bools into flags later */
+/* values for txn_flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_PREPARE 0x0004
+#define TXN_COMMIT_PREPARED 0x0008
+#define TXN_ROLLBACK_PREPARED 0x0010
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -292,6 +308,39 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -327,6 +376,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -369,6 +424,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -392,6 +452,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
old mode 100644
new mode 100755
On 12 December 2017 at 12:04, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
This is an issue because of the way we are handling invalidations. We
don't allow ReorderBufferAddInvalidations() at COMMIT PREPARE time
since we assume that handling them at PREPARE time is enough.
Apparently, it's not enough.
Not sure what that means.
I think we would need to fire invalidations at COMMIT PREPARED, yet
logically decode them at PREPARE.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I think we would need to fire invalidations at COMMIT PREPARED, yet
logically decode them at PREPARE.
Yes, we need invalidations to logically decode at PREPARE and then we need
invalidations to be executed at COMMIT PREPARED time as well.
DecodeCommit() needs to know when it's processing a COMMIT PREPARED
whether this transaction was decoded at PREPARE time.The main issue is
that we cannot expect the ReorderBufferTXN structure which was created
at PREPARE time to be around till the COMMIT PREPARED gets called. The
patch earlier was not cleaning this structure at PREPARE and was
adding an is_prepared flag to it so that COMMIT PREPARED knew that it
was decoded at PREPARE time. This structure can very well be not
around when you restart between PREPARE and COMMIT PREPARED, for
example.
So now, it's the onus of the prepare filter callback to always give us
the answer if a given transaction was decoded at PREPARE time or not.
We now hand over the ReorderBufferTxn structure (it can be NULL), xid
and gid and the prepare filter tells us what to do. Always. The
is_prepared flag can be cached in the txn structure to aid in
re-lookups, but if it's not set, the filter could do xid lookup, gid
inspection and other shenanigans to give us the same answer every
invocation around.
Because of the above, we can very well cleanup the ReorderBufferTxn at
PREPARE time and it need not hang around till COMMIT PREPARED gets
called, which is a good win in terms of resource management.
My test cases pass (including the scenario described earlier) with the
above code changes in place.
I have also added crash testing related TAP test cases, they uncovered
a bug in the prepare redo restart code path which I fixed. I believe
this patch is in very stable state now. Multiple runs of the crash TAP
test pass without issues. Multiple runs of "make check-world" with
cassert enabled also pass without issues.
Note that this patch does not contain the HeapTupleSatisfiesVacuum
changes. I believe we need changes to HeapTupleSatisfiesVacuum given
than logical decoding changes the assumption that catalog tuples
belonging to a transaction which never committed can be reclaimed
immediately. With 2PC logical decoding or streaming logical decoding,
we can always have a split time window in which the ongoing decode
cycle needs those tuples. The solution is that even for aborted
transactions, we do not return HEAPTUPLE_DEAD if the transaction id is
newer than the OldestXmin (same logic we use for deleted tuples of
committed transactions). We can do this only for catalog table rows
(both system and user defined) to limit the scope of impact. In any
case, this needs to be a separate patch along with a separate
discussion thread.
Peter, I will submit a follow-on patch with documentation changes
soon. But this patch is complete IMO, with all the required 2PC
logical decoding functionality.
Comments, feedback is most welcome.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_19_12_17_without_docs.patchapplication/octet-stream; name=2pc_logical_19_12_17_without_docs.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189..79b9622 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..2df0b6c 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,15 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..4197766 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,5 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..c0126fc
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,85 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We will fire off a ROLLBACK from another session when this
+# delayed decode is ongoing. That will stop decoding immediately and the next
+# pg_logical_slot_get_changes call should show only a few records decoded from
+# the entire two phase transaction
+
+# consume all changes so far
+#$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check for occurrence of log about stopping decoding
+my $output_file = slurp_file($node_logical->logfile());
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 135b3b7..7dc74f5 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,9 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
+ int decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +64,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +75,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static bool pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->filter_decode_txn_cb = pg_filter_decode_txn;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +134,9 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -176,6 +206,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -244,6 +310,156 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ if (txn && txn->has_catalog_changes &&
+ !data->twophase_decode_with_catalog_changes)
+ return true;
+
+ /*
+ * even if txn is NULL, decode since twophase_decoding is set
+ */
+ return false;
+}
+
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -412,6 +628,10 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ /* if decode_delay is specified, sleep for those many seconds */
+ if (data->decode_delay > 0)
+ pg_usleep(data->decode_delay * 1000000L);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3aafa79..1a4487d 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 321da9f..1f60e80 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(missing_ok && !isCommit);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,11 +1510,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1828,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2242,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2271,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2333,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2357,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2388,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2438,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93d740..b05e0f5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5244,7 +5245,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5257,7 +5257,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5269,6 +5270,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5331,6 +5333,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5381,8 +5390,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5402,15 +5422,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5446,6 +5470,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5460,6 +5509,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5477,7 +5529,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -5800,7 +5868,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c..49637e6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,16 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -607,9 +627,71 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -621,6 +703,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bca585f..2a13d2e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -124,6 +136,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -182,8 +195,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -680,6 +712,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -714,6 +862,62 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9b126b2..77b9f58 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5ac391d..4ab9def 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1264,31 +1264,25 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+
/*
* If this transaction didn't have any real changes in our database, it's
* OK not to have a snapshot. Note that ReorderBufferCommitChild will have
@@ -1326,20 +1320,62 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = txn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1546,6 +1582,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1598,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1638,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1624,6 +1679,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa5d9bb..444b5d5 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,120 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1003,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 550b156..9628a53 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -252,6 +272,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
* Sends the decoded DML over wire.
*/
static void
@@ -362,6 +437,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
+/*
* Currently we always forward.
*/
static bool
@@ -372,6 +459,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
}
/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/*
* Shutdown the output plugin.
*
* Note, we don't need to clean the data->context as it's child context
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4da1f8f..a19bae1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -454,13 +454,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 2b218e0..6c37832 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1252,6 +1252,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
*/
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
+
return HEAPTUPLE_DEAD;
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f5fbbea..cd946a1 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,15 +47,18 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8..118156e 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7f0e0fa..758de40 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index a9736e1..7f51f75 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 26ff024..5c61f76 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -85,6 +125,12 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
RepOriginId origin_id);
/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
@@ -98,8 +144,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b18ce5a..51095e1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,13 +138,28 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+
+/* TODO: convert existing bools into flags later */
+/* values for txn_flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_PREPARE 0x0004
+#define TXN_COMMIT_PREPARED 0x0008
+#define TXN_ROLLBACK_PREPARED 0x0010
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -292,6 +308,40 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -327,6 +377,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -369,6 +425,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -392,6 +453,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/test/subscription/t/009_twophase.pl b/src/test/subscription/t/009_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/009_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
On 12/19/17 03:37, Nikhil Sontakke wrote:
Note that this patch does not contain the HeapTupleSatisfiesVacuum
changes. I believe we need changes to HeapTupleSatisfiesVacuum given
than logical decoding changes the assumption that catalog tuples
belonging to a transaction which never committed can be reclaimed
immediately. With 2PC logical decoding or streaming logical decoding,
we can always have a split time window in which the ongoing decode
cycle needs those tuples. The solution is that even for aborted
transactions, we do not return HEAPTUPLE_DEAD if the transaction id is
newer than the OldestXmin (same logic we use for deleted tuples of
committed transactions). We can do this only for catalog table rows
(both system and user defined) to limit the scope of impact. In any
case, this needs to be a separate patch along with a separate
discussion thread.
Are you working on that as well?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Note that this patch does not contain the HeapTupleSatisfiesVacuum
changes. I believe we need changes to HeapTupleSatisfiesVacuum given
than logical decoding changes the assumption that catalog tuples
belonging to a transaction which never committed can be reclaimed
immediately. With 2PC logical decoding or streaming logical decoding,
we can always have a split time window in which the ongoing decode
cycle needs those tuples. The solution is that even for aborted
transactions, we do not return HEAPTUPLE_DEAD if the transaction id is
newer than the OldestXmin (same logic we use for deleted tuples of
committed transactions). We can do this only for catalog table rows
(both system and user defined) to limit the scope of impact. In any
case, this needs to be a separate patch along with a separate
discussion thread.Are you working on that as well?
Sure, I was planning to work on that after getting the documentation
for this patch out of the way.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi,
Are you working on that as well?
Sure, I was planning to work on that after getting the documentation
for this patch out of the way.
PFA, patch with documentation. Have added requisite entries in the
logical decoding output plugins section. No changes are needed
elsewhere, AFAICS.
I will submit the HeapTupleSatisfiesVacuum patch on a separate
discussion, soon.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_22_12_17.patchapplication/octet-stream; name=2pc_logical_22_12_17.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..2df0b6c198 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,15 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..4197766c50 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,5 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100755
index 0000000000..c0126fca5b
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,85 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We will fire off a ROLLBACK from another session when this
+# delayed decode is ongoing. That will stop decoding immediately and the next
+# pg_logical_slot_get_changes call should show only a few records decoded from
+# the entire two phase transaction
+
+# consume all changes so far
+#$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check for occurrence of log about stopping decoding
+my $output_file = slurp_file($node_logical->logfile());
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 135b3b7638..7dc74f5439 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,9 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
+ int decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +64,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +75,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static bool pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->filter_decode_txn_cb = pg_filter_decode_txn;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +134,9 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -176,6 +206,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -244,6 +310,156 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ if (txn && txn->has_catalog_changes &&
+ !data->twophase_decode_with_catalog_changes)
+ return true;
+
+ /*
+ * even if txn is NULL, decode since twophase_decoding is set
+ */
+ return false;
+}
+
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -412,6 +628,10 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ /* if decode_delay is specified, sleep for those many seconds */
+ if (data->decode_delay > 0)
+ pg_usleep(data->decode_delay * 1000000L);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 6bab1b9b32..abaa57601f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -382,8 +382,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -452,7 +458,12 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding will be aborted midways.
</para>
<note>
@@ -548,6 +559,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -612,6 +691,53 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-decode">
+ <title>Decode Filter Callback</title>
+
+ <para>
+ The optional <function>filter_decode_txn_cb</function> callback
+ is called to determine whether data that is part of the current
+ transaction should be continued to be decoded.
+<programlisting>
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction, like its XID.
+ Note however that it can be NULL in some cases. To signal that decoding process
+ should terminate, return true; false otherwise.
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return true; false otherwise.
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called. To signal that decoding should be skipped, return true; false otherwise.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 3aafa79e52..1a4487d404 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 321da9f5f6..1f60e80456 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(missing_ok && !isCommit);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,11 +1510,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1828,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2242,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2271,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2333,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2357,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2388,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2438,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b37510c24f..fa54463cd0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5247,7 +5248,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5260,7 +5260,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5272,6 +5273,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5334,6 +5336,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5384,8 +5393,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5405,15 +5425,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5449,6 +5473,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5463,6 +5512,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5480,7 +5532,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -5803,7 +5871,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c988..49637e6312 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -277,16 +280,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -607,9 +627,71 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -621,6 +703,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index bca585fc27..2a13d2e37a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -124,6 +136,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -182,8 +195,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -680,6 +712,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -714,6 +862,62 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9b126b2957..77b9f58ae2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5ac391dbda..4ab9defea2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1264,31 +1264,25 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+
/*
* If this transaction didn't have any real changes in our database, it's
* OK not to have a snapshot. Note that ReorderBufferCommitChild will have
@@ -1326,20 +1320,62 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = txn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1546,6 +1582,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1598,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1638,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1624,6 +1679,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa5d9bb120..444b5d5db8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,120 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1003,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 550b156e2d..9628a53584 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -252,6 +272,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
* Sends the decoded DML over wire.
*/
static void
@@ -362,6 +437,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
+/*
* Currently we always forward.
*/
static bool
@@ -372,6 +459,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
}
/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/*
* Shutdown the output plugin.
*
* Note, we don't need to clean the data->context as it's child context
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4da1f8f643..a19bae187e 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -454,13 +454,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 2b218e07e6..6c37832de5 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1252,6 +1252,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
*/
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
+
return HEAPTUPLE_DEAD;
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f5fbbea4b6..cd946a1a2a 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,15 +47,18 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8432..118156ed78 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7f0e0fa881..758de40db9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -82,6 +82,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index a9736e1bf6..7f51f75b97 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 26ff024882..5c61f76c66 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -85,6 +125,12 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
RepOriginId origin_id);
/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
+/*
* Called to shutdown an output plugin.
*/
typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
@@ -98,8 +144,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b18ce5a9df..51095e1da3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,13 +138,28 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+
+/* TODO: convert existing bools into flags later */
+/* values for txn_flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_PREPARE 0x0004
+#define TXN_COMMIT_PREPARED 0x0008
+#define TXN_ROLLBACK_PREPARED 0x0010
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/* did the TX have catalog changes */
bool has_catalog_changes;
@@ -292,6 +308,40 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -327,6 +377,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -369,6 +425,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -392,6 +453,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/test/subscription/t/009_twophase.pl b/src/test/subscription/t/009_twophase.pl
new file mode 100755
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/009_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
Hi,
PFA, patch with documentation. Have added requisite entries in the
logical decoding output plugins section. No changes are needed
elsewhere, AFAICS.
PFA, patch which applies cleanly against latest git head. I also
removed unwanted newlines and took care of the cleanup TODO about
making ReorderBufferTXN structure using a txn_flags field instead of
separate booleans for various statuses like has_catalog_changes,
is_subxact, is_serialized etc. The patch uses this txn_flags field for
the newer prepare related info as well.
"make check-world" passes ok, including the additional regular and tap
tests that we have added as part of this patch.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_29_01_18.patchapplication/octet-stream; name=2pc_logical_29_01_18.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..2df0b6c198 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,15 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..4197766c50 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,5 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..c0126fca5b
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,85 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We will fire off a ROLLBACK from another session when this
+# delayed decode is ongoing. That will stop decoding immediately and the next
+# pg_logical_slot_get_changes call should show only a few records decoded from
+# the entire two phase transaction
+
+# consume all changes so far
+#$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check for occurrence of log about stopping decoding
+my $output_file = slurp_file($node_logical->logfile());
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0f18afa852..b2dee6cfc2 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,9 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
+ int decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +64,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +75,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static bool pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->filter_decode_txn_cb = pg_filter_decode_txn;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +134,9 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -156,7 +186,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "skip-empty-xacts") == 0)
{
-
if (elem->arg == NULL)
data->skip_empty_xacts = true;
else if (!parse_bool(strVal(elem->arg), &data->skip_empty_xacts))
@@ -167,7 +196,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "only-local") == 0)
{
-
if (elem->arg == NULL)
data->only_local = true;
else if (!parse_bool(strVal(elem->arg), &data->only_local))
@@ -176,6 +204,41 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -244,6 +307,156 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ if (txn && txn_has_catalog_changes(txn) &&
+ !data->twophase_decode_with_catalog_changes)
+ return true;
+
+ /*
+ * even if txn is NULL, decode since twophase_decoding is set
+ */
+ return false;
+}
+
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -412,6 +625,10 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ /* if decode_delay is specified, sleep for those many seconds */
+ if (data->decode_delay > 0)
+ pg_usleep(data->decode_delay * 1000000L);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index fa101937e5..c66f1f7d20 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,8 +384,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +460,12 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding will be aborted midways.
</para>
<note>
@@ -550,6 +561,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -614,6 +693,53 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-decode">
+ <title>Decode Filter Callback</title>
+
+ <para>
+ The optional <function>filter_decode_txn_cb</function> callback
+ is called to determine whether data that is part of the current
+ transaction should be continued to be decoded.
+<programlisting>
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction, like its XID.
+ Note however that it can be NULL in some cases. To signal that decoding process
+ should terminate, return true; false otherwise.
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return true; false otherwise.
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called. To signal that decoding should be skipped, return true; false otherwise.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..d13e7cc146 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(missing_ok && !isCommit);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,11 +1510,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1828,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2242,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2271,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2333,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2357,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2388,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2438,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ea81f4b5de..e19fac4f7b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5247,7 +5248,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5260,7 +5260,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5272,6 +5273,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5334,6 +5336,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5384,8 +5393,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5405,15 +5425,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5449,6 +5473,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5463,6 +5512,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5480,7 +5532,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -5803,7 +5871,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..b45739d971 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,71 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +723,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7637efc32e..977db8eec1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +197,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -693,6 +725,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -730,6 +878,62 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..5d33931223 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c72a611a39..cf4013af4b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -623,7 +623,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!txn_is_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -641,7 +641,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!txn_is_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -675,9 +675,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!txn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= TXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -738,9 +738,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!txn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= TXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -849,7 +849,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (txn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -878,7 +878,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (txn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1044,7 +1044,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(txn_is_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1083,7 +1083,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn_is_subxact(), we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1098,7 +1098,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (txn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1115,7 +1115,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!txn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1264,25 +1264,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1326,20 +1319,62 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = txn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1546,6 +1581,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1597,20 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1637,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1623,6 +1677,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1688,7 +1872,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (txn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1934,7 +2118,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= TXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1951,7 +2135,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return txn_has_catalog_changes(txn);
}
/*
@@ -2095,7 +2279,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= TXN_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 83c69092ae..15048378d1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,120 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1003,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 40a1ef3c1d..55bdee9abe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -251,6 +271,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
/*
* Sends the decoded DML over wire.
*/
@@ -361,6 +436,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
MemoryContextReset(data->context);
}
+/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
/*
* Currently we always forward.
*/
@@ -371,6 +458,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
return false;
}
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
/*
* Shutdown the output plugin.
*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 3abe7d6155..8a6e0a1c2d 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -455,13 +455,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..cbc63a18ad 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,15 +47,18 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6445bbc46f..d2e104423d 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..2d93fd9365 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0eb21057c5..886025f3aa 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 78fd38bb16..61c5019adf 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -67,6 +67,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -84,6 +124,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ctx,
RepOriginId origin_id);
+/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
/*
* Called to shutdown an output plugin.
*/
@@ -98,8 +144,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0970abca52..1a3a12f4e5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,20 +138,40 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_SERIALIZED 0x0004
+#define TXN_PREPARE 0x0008
+#define TXN_COMMIT_PREPARED 0x0010
+#define TXN_ROLLBACK_PREPARED 0x0020
+
+/* does the txn have catalog changes */
+#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED)
+/* is this txn prepared? */
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -214,15 +235,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
@@ -294,6 +306,40 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -329,6 +375,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -371,6 +423,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -394,6 +451,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/test/subscription/t/009_twophase.pl b/src/test/subscription/t/009_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/009_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
Hi all,
PFA, patch which applies cleanly against latest git head. I also
removed unwanted newlines and took care of the cleanup TODO about
making ReorderBufferTXN structure using a txn_flags field instead of
separate booleans for various statuses like has_catalog_changes,
is_subxact, is_serialized etc. The patch uses this txn_flags field for
the newer prepare related info as well."make check-world" passes ok, including the additional regular and tap
tests that we have added as part of this patch.
PFA, latest version of this patch.
This latest version takes care of the abort-while-decoding issue along
with additional test cases and documentation changes.
We now maintain a list of processes that are decoding a specific
transactionID and make it a decode groupmember of a decode groupleader
process. The decode groupleader process is basically the PGPROC entry
which points to the prepared 2PC transaction or an ongoing regular
transaction.
If the 2PC is rollback'ed then FinishPreparedTransactions uses the
decode groupleader process to let all the decode groupmember processes
know that it's aborting. A similar logic can be used for the decoding
of uncommitted transactions. The decode groupmember processes are able
to abort sanely in such a case. We also have two new APIs
"LogicalLockTransaction" and "LogicalUnlockTransaction" that the
decoding backends need to use while doing system or user catalog
tables access. The abort code interlocks with decoding backends that
might be in the process of accessing catalog tables and waits for
those few moments before aborting the transaction.
The implementation uses the LockHashPartitionLockByProc on the decode
groupleader process to control access to these additional fields in
the PGPROC structure amongst the decode groupleader and the other
decode groupmember processes and does not need to use the
ProcArrayLock at all. The implementation is inspired from the
*existing* lockGroupLeader solution which uses a similar technique to
track processes waiting on a leader holding that lock. I believe it's
an optimal solution for this problem of ours.
Have added TAP tests to test multiple decoding backends working on the
same transaction. Used delays in the test-decoding plugin to introduce
waits after making the LogicalLockTransaction call and calling
ROLLBACK to ensure that it interlocks with such decoding backends
which are doing catalog access. Tests working as desired. Also "make
check-world" passes with asserts enabled.
I will post this same explanation about abort handling on the other
thread (http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294.html).
Comments appreciated.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2pc_logical_with_abort_handling_06_02_18.patchapplication/octet-stream; name=2pc_logical_with_abort_handling_06_02_18.patchDownload
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..2df0b6c198 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,123 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+:get_with2pc_nofilter
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +130,226 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc_nofilter
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +357,15 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..4197766c50 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,41 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc_nofilter', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'');'
+\set get_with2pc_nofilter 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc_nofilter'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''twophase-decoding'', ''1'', ''twophase-decode-with-catalog-changes'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
INSERT INTO test_prepared1 VALUES (4);
@@ -27,18 +46,74 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists. Our 2pc filter callback will skip decoding of xacts
+-- with catalog changes at PREPARE time, so we don't decode it now.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+:get_with2pc_nofilter
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+:get_with2pc_nofilter
-- cleanup
DROP TABLE test_prepared1;
@@ -48,3 +123,5 @@ DROP TABLE test_prepared2;
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+SELECT pg_drop_replication_slot('regression_slot_2pc_nofilter');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..6722317c9f
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'twophase-decoding', '1', 'twophase-decode-with-catalog-changes', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0f18afa852..477c950b8d 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,9 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool twophase_decoding;
+ bool twophase_decode_with_catalog_changes;
+ int decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +64,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +75,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static bool pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->filter_decode_txn_cb = pg_filter_decode_txn;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +134,9 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->twophase_decoding = false;
+ data->twophase_decode_with_catalog_changes = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -156,7 +186,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "skip-empty-xacts") == 0)
{
-
if (elem->arg == NULL)
data->skip_empty_xacts = true;
else if (!parse_bool(strVal(elem->arg), &data->skip_empty_xacts))
@@ -167,7 +196,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "only-local") == 0)
{
-
if (elem->arg == NULL)
data->only_local = true;
else if (!parse_bool(strVal(elem->arg), &data->only_local))
@@ -176,6 +204,41 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "twophase-decoding") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decoding = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decoding))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "twophase-decode-with-catalog-changes") == 0)
+ {
+ if (elem->arg == NULL)
+ data->twophase_decode_with_catalog_changes = true;
+ else if (!parse_bool(strVal(elem->arg), &data->twophase_decode_with_catalog_changes))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -244,6 +307,156 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->twophase_decoding)
+ return true;
+
+ if (txn && txn_has_catalog_changes(txn) &&
+ !data->twophase_decode_with_catalog_changes)
+ return true;
+
+ /*
+ * even if txn is NULL, decode since twophase_decoding is set
+ */
+ return false;
+}
+
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pg_filter_decode_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (!data->twophase_decoding)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ if (!LogicalLockTransaction(txn))
+ return;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ LogicalUnlockTransaction(txn);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5501eed108..7edda72e5e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,8 +384,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +460,12 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding will be aborted midways.
</para>
<note>
@@ -550,6 +561,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -559,12 +638,30 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. The <function>change_cb</function> call should invoke
+ <function>LogicalLockTransaction</function> function before such access of
+ system or user catalog tables. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ function interlocks the decoding activity with simultaneous rollback by
+ another backend of this very same transaction. The
+ <function>change_cb</function> should invoke
+ <function>LogicalUnlockTransaction</function> function immediately after
+ the catalog tables access.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
Relation relation,
ReorderBufferChange *change);
+</programlisting>
+ Here's an example of the use of <function>LogicalLockTransaction</function>
+ and <function>LogicalUnlockTransaction</function> in an output plugin:
+<programlisting>
+ if (!LogicalLockTransaction(txn))
+ return;
+ relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
</programlisting>
The <parameter>ctx</parameter> and <parameter>txn</parameter> parameters
have the same contents as for the <function>begin_cb</function>
@@ -614,6 +711,53 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-decode">
+ <title>Decode Filter Callback</title>
+
+ <para>
+ The optional <function>filter_decode_txn_cb</function> callback
+ is called to determine whether data that is part of the current
+ transaction should be continued to be decoded.
+<programlisting>
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction, like its XID.
+ Note however that it can be NULL in some cases. To signal that decoding process
+ should terminate, return true; false otherwise.
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return true; false otherwise.
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called. To signal that decoding should be skipped, return true; false otherwise.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..97499707f7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -556,7 +554,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
*/
static GlobalTransaction
-LockGXact(const char *gid, Oid user)
+LockGXact(const char *gid, Oid user, bool missing_ok)
{
int i;
@@ -616,7 +614,8 @@ LockGXact(const char *gid, Oid user)
LWLockRelease(TwoPhaseStateLock);
- ereport(ERROR,
+ if (!missing_ok)
+ ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("prepared transaction with identifier \"%s\" does not exist",
gid)));
@@ -898,7 +897,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +913,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1066,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1309,43 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->ncommitrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->commitrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortrels = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid)
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
*/
void
-FinishPreparedTransaction(const char *gid, bool isCommit)
+FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok)
{
GlobalTransaction gxact;
PGPROC *proc;
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
* try to commit the same GID at once.
+ *
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
*/
- gxact = LockGXact(gid, GetUserId());
+ gxact = LockGXact(gid, GetUserId(), missing_ok);
+ if (gxact == NULL)
+ {
+ Assert(missing_ok && !isCommit);
+ return;
+ }
+
proc = &ProcGlobal->allProcs[gxact->pgprocno];
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
xid = pgxact->xid;
@@ -1435,13 +1510,19 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
@@ -1752,7 +1833,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2247,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2276,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2338,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2362,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2393,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ea81f4b5de..e19fac4f7b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5247,7 +5248,6 @@ xactGetCommittedChildren(TransactionId **ptr)
* XLOG support routines
*/
-
/*
* Log the commit record for a plain or twophase transaction commit.
*
@@ -5260,7 +5260,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5272,6 +5273,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5334,6 +5336,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5384,8 +5393,19 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5405,15 +5425,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5449,6 +5473,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5463,6 +5512,9 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5480,7 +5532,23 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
return XLogInsert(RM_XACT_ID, info);
}
@@ -5803,7 +5871,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..b45739d971 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,71 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +723,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7637efc32e..50c08acc34 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +197,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -693,6 +725,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -730,6 +878,62 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
@@ -1013,3 +1217,164 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!txn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!txn_prepared(txn))
+ return true;
+
+ /* check cached status */
+ if (txn_commit(txn))
+ return true;
+ if (txn_rollback(txn))
+ return false;
+
+ /*
+ * Find the PROC that is handling this XID and add ourself as a
+ * decodeGroupMember
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, txn_prepared(txn));
+
+ /*
+ * If decodeGroupLeader is NULL, then the only possibility
+ * is that the transaction completed and went away
+ */
+ if (proc == NULL)
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= TXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK;
+ return false;
+ }
+ }
+
+ /* Add ourself as a decodeGroupMember */
+ if (!BecomeDecodeGroupMember(proc, proc->pid, txn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= TXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us. Check if the transaction is still around
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ if (MyProc->decodeGroupLeader)
+ {
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= TXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+ else
+ return false;
+
+ return ok;
+}
+
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!txn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!txn_prepared(txn))
+ return;
+
+ /* check cached status */
+ if (txn_commit(txn))
+ return;
+ if (txn_rollback(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ txn->txn_flags |= TXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..5d33931223 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (txn->txn_flags & TXN_COMMIT_PREPARED)
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (txn->txn_flags & TXN_PREPARE)
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c72a611a39..0b45f4f20f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -623,7 +623,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!txn_is_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -641,7 +641,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!txn_is_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -675,9 +675,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!txn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= TXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -738,9 +738,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!txn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= TXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -849,7 +849,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (txn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -878,7 +878,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (txn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1044,7 +1044,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(txn_is_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1083,7 +1083,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn_is_subxact(), we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1098,7 +1098,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (txn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1115,7 +1115,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!txn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1264,25 +1264,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1326,20 +1319,62 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = txn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1375,7 +1410,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
@@ -1546,6 +1591,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1607,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (txn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1651,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1623,6 +1691,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= TXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= TXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= TXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1688,7 +1886,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (txn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1934,7 +2132,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= TXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1951,7 +2149,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return txn_has_catalog_changes(txn);
}
/*
@@ -2095,7 +2293,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= TXN_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 83c69092ae..15048378d1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,120 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true, false);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, false, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1003,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 40a1ef3c1d..55bdee9abe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -251,6 +271,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
/*
* Sends the decoded DML over wire.
*/
@@ -361,6 +436,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
MemoryContextReset(data->context);
}
+/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
/*
* Currently we always forward.
*/
@@ -371,6 +458,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
return false;
}
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
/*
* Shutdown the output plugin.
*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a00011adc..f6bb4e509f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..26d35c7807 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -1887,3 +1904,268 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * BecomeDecodeGroupLeader - designate process as decode group leader
+ *
+ * Once this function has returned, other processes can join the decode group
+ * by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+
+ /*
+ * This proc will become decodeGroupLeader if it's
+ * not already
+ */
+ if (proc && proc->decodeGroupLeader != proc)
+ {
+ volatile PGXACT *pgxact;
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+ if (proc->pid == pid && pgxact->xid == xid)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader != proc)
+ {
+ /* We had better not be a follower. */
+ Assert(proc->decodeGroupLeader == NULL);
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Acquire lock
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Assumes that the caller is holding appropriate lock
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * Indicate to all decodeGroupMembers that this transaction is
+ * going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning
+ * from here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been
+ * removed from the procArray.
+ *
+ * If the transaction is committing, it's ok for the
+ * decoders to continue merrily. When it tries to lock this
+ * proc, it won't find it and check for transaction status
+ * and cache the commit status for future calls in
+ * LogicalLockTransaction
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* mark ourself as aborting */
+ if (!isCommit)
+ leader->decodeAbortPending = true;
+
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+ /* mark the proc to indicate abort is pending */
+ if (proc == leader)
+ continue;
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 3abe7d6155..8a6e0a1c2d 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -455,13 +455,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
case TRANS_STMT_COMMIT_PREPARED:
PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
PreventCommandDuringRecovery("COMMIT PREPARED");
- FinishPreparedTransaction(stmt->gid, true);
+ FinishPreparedTransaction(stmt->gid, true, false);
break;
case TRANS_STMT_ROLLBACK_PREPARED:
PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
PreventCommandDuringRecovery("ROLLBACK PREPARED");
- FinishPreparedTransaction(stmt->gid, false);
+ FinishPreparedTransaction(stmt->gid, false, false);
break;
case TRANS_STMT_ROLLBACK:
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..cbc63a18ad 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,15 +47,18 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
-extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+extern void FinishPreparedTransaction(const char *gid, bool isCommit,
+ bool missing_ok);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6445bbc46f..d2e104423d 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,13 +310,40 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef struct xl_xact_parsed_prepare
+{
+ Oid dbId; /* MyDatabaseId */
+
+ int nsubxacts;
+ TransactionId *subxacts;
+
+ int ncommitrels;
+ RelFileNode *commitrels;
+
+ int nabortrels;
+ RelFileNode *abortrels;
+
+ int nmsgs;
+ SharedInvalidationMessage *msgs;
+
+ TransactionId twophase_xid;
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
+} xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -319,6 +354,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE];
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +425,13 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid, const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..9dad4c997f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
@@ -117,6 +122,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0eb21057c5..886025f3aa 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 78fd38bb16..61c5019adf 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -67,6 +67,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -84,6 +124,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ctx,
RepOriginId origin_id);
+/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
/*
* Called to shutdown an output plugin.
*/
@@ -98,8 +144,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0970abca52..a43e941b25 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -137,20 +138,50 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define TXN_HAS_CATALOG_CHANGES 0x0001
+#define TXN_IS_SUBXACT 0x0002
+#define TXN_SERIALIZED 0x0004
+#define TXN_PREPARE 0x0008
+#define TXN_COMMIT_PREPARED 0x0010
+#define TXN_ROLLBACK_PREPARED 0x0020
+#define TXN_COMMIT 0x0040
+#define TXN_ROLLBACK 0x0080
+
+/* does the txn have catalog changes */
+#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED)
+/* is this txn prepared? */
+#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define txn_commit_prepared(txn) (txn->txn_flags & TXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define txn_rollback_prepared(txn) (txn->txn_flags & TXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define txn_commit(txn) (txn->txn_flags & TXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define txn_rollback(txn) (txn->txn_flags & TXN_ROLLBACK)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -214,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
@@ -294,6 +316,40 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -329,6 +385,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -371,6 +433,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -394,6 +461,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..fdfc582874 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -326,5 +346,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..68173743ae 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -98,6 +98,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
extern int BackendXidGetPid(TransactionId xid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern bool IsBackendPid(int pid);
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
diff --git a/src/test/subscription/t/009_twophase.pl b/src/test/subscription/t/009_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/009_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
Hi!
Thanks for working on this patch.
Reading through patch I’ve noticed that you deleted call to SnapBuildCommitTxn()
in DecodePrepare(). As you correctly spotted upthread there was unnecessary
code that marked transaction as running after decoding of prepare. However call
marking it as committed before decoding of prepare IMHO is still needed as
SnapBuildCommitTxn does some useful thing like setting base snapshot for parent
transactions which were skipped because of SnapBuildXactNeedsSkip().
E.g. current code will crash in assert for following transaction:
BEGIN;
SAVEPOINT one;
CREATE TABLE test_prepared_savepoints (a int);
PREPARE TRANSACTION 'x';
COMMIT PREPARED 'x';
:get_with2pc_nofilter
:get_with2pc_nofilter <- second call will crash decoder
With following backtrace:
frame #3: 0x000000010dc47b40 postgres`ExceptionalCondition(conditionName="!(txn->ninvalidations == 0)", errorType="FailedAssertion", fileName="reorderbuffer.c", lineNumber=1944) at assert.c:54
frame #4: 0x000000010d9ff4dc postgres`ReorderBufferForget(rb=0x00007fe1ab832318, xid=816, lsn=35096144) at reorderbuffer.c:1944
frame #5: 0x000000010d9f055c postgres`DecodePrepare(ctx=0x00007fe1ab81b918, buf=0x00007ffee2650408, parsed=0x00007ffee2650088) at decode.c:703
frame #6: 0x000000010d9ef718 postgres`DecodeXactOp(ctx=0x00007fe1ab81b918, buf=0x00007ffee2650408) at decode.c:310
That can be fixed by calling SnapBuildCommitTxn() in DecodePrepare()
which I believe is safe because during normal work prepared transaction
holds relation locks until commit/abort and in between nobody can access
altered relations (or just I don’t know such situations — that was the reason
why i had marked that xids as running in previous versions).
On 6 Feb 2018, at 15:20, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
Hi all,
PFA, patch which applies cleanly against latest git head. I also
removed unwanted newlines and took care of the cleanup TODO about
making ReorderBufferTXN structure using a txn_flags field instead of
separate booleans for various statuses like has_catalog_changes,
is_subxact, is_serialized etc. The patch uses this txn_flags field for
the newer prepare related info as well."make check-world" passes ok, including the additional regular and tap
tests that we have added as part of this patch.PFA, latest version of this patch.
This latest version takes care of the abort-while-decoding issue along
with additional test cases and documentation changes.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Stas,
Reading through patch I’ve noticed that you deleted call to SnapBuildCommitTxn()
in DecodePrepare(). As you correctly spotted upthread there was unnecessary
code that marked transaction as running after decoding of prepare. However call
marking it as committed before decoding of prepare IMHO is still needed as
SnapBuildCommitTxn does some useful thing like setting base snapshot for parent
transactions which were skipped because of SnapBuildXactNeedsSkip().E.g. current code will crash in assert for following transaction:
BEGIN;
SAVEPOINT one;
CREATE TABLE test_prepared_savepoints (a int);
PREPARE TRANSACTION 'x';
COMMIT PREPARED 'x';
:get_with2pc_nofilter
:get_with2pc_nofilter <- second call will crash decoder
Thanks for taking a look!
The first ":get_with2pc_nofilter" call consumes the data appropriately.
The second ":get_with2pc_nofilter" sees that it has to skip and hence
enters the ReorderBufferForget() function in the skip code path
causing the assert. If we have to skip anyways why do we need to setup
SnapBuildCommitTxn() for such a transaction is my query? I don't see
the need for doing that for skipped transactions..
Will continue to look at this and will add this scenario to the test
cases. Further comments/feedback appreciated.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi,
First off: This patch has way too many different types of changes as
part of one huge commit. This needs to be split into several
pieces. First the cleanups (e.g. the fields -> flag changes), then the
individual infrastructure pieces (like the twophase.c changes, best
split into several pieces as well, the locking stuff), then the main
feature, then support for it in the output plugin. Each should have an
individual explanation about why the change is necessary and not a bad
idea.
On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote:
@@ -46,6 +48,9 @@ typedef struct bool skip_empty_xacts; bool xact_wrote_changes; bool only_local; + bool twophase_decoding; + bool twophase_decode_with_catalog_changes; + int decode_delay; /* seconds to sleep after every change record */
This seems too big a crock to add just for testing. It'll also make the
testing timing dependent...
} TestDecodingData;
void
_PG_init(void)
@@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter; cb->shutdown_cb = pg_decode_shutdown; cb->message_cb = pg_decode_message; + cb->filter_prepare_cb = pg_filter_prepare; + cb->filter_decode_txn_cb = pg_filter_decode_txn; + cb->prepare_cb = pg_decode_prepare_txn; + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; }
Why does this introduce both abort_cb and abort_prepared_cb? That seems
to conflate two separate features.
+/* Filter out unnecessary two-phase transactions */ +static bool +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + TransactionId xid, const char *gid) +{ + TestDecodingData *data = ctx->output_plugin_private; + + /* treat all transactions as one-phase */ + if (!data->twophase_decoding) + return true; + + if (txn && txn_has_catalog_changes(txn) && + !data->twophase_decode_with_catalog_changes) + return true;
What? I'm INCREDIBLY doubtful this is a sane thing to expose to output
plugins. As in, unless I hear a very very convincing reason I'm strongly
opposed.
+/* + * Check if we should continue to decode this transaction. + * + * If it has aborted in the meanwhile, then there's no sense + * in decoding and sending the rest of the changes, we might + * as well ask the subscribers to abort immediately. + * + * This should be called if we are streaming a transaction + * before it's committed or if we are decoding a 2PC + * transaction. Otherwise we always decode committed + * transactions + * + * Additional checks can be added here, as needed + */ +static bool +pg_filter_decode_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn) +{ + /* + * Due to caching, repeated TransactionIdDidAbort calls + * shouldn't be that expensive + */ + if (txn != NULL && + TransactionIdIsValid(txn->xid) && + TransactionIdDidAbort(txn->xid)) + return true; + + /* if txn is NULL, filter it out */
Why can this be NULL?
+ return (txn != NULL)? false:true;
+}
This definitely shouldn't be a task for each output plugin. Even if we
want to make this configurable, I'm doubtful that it's a good idea to do
so here - make its much less likely to hit edge cases.
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;+ if (!LogicalLockTransaction(txn))
+ return;
It really really can't be right that this is exposed to output plugins.
+ /* if decode_delay is specified, sleep with above lock held */ + if (data->decode_delay > 0) + { + elog(LOG, "sleeping for %d seconds", data->decode_delay); + pg_usleep(data->decode_delay * 1000000L); + }
Really not on board.
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);+ replorigin = (replorigin_session_origin != InvalidRepOriginId && + replorigin_session_origin != DoNotReplicateId); + + if (replorigin) + { + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); + hdr->origin_lsn = replorigin_session_origin_lsn; + hdr->origin_timestamp = replorigin_session_origin_timestamp; + } + else + { + hdr->origin_lsn = InvalidXLogRecPtr; + hdr->origin_timestamp = 0; + } + /* * If the data size exceeds MaxAllocSize, we won't be able to read it in * ReadTwoPhaseFile. Check for that now, rather than fail in the case @@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact) XLogBeginInsert(); for (record = records.head; record != NULL; record = record->next) XLogRegisterData(record->data, record->len); + + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); +
Can we perhaps merge a bit of the code with the plain commit path on
this?
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE); + + if (replorigin) + /* Move LSNs forward for this replication origin */ + replorigin_session_advance(replorigin_session_origin_lsn, + gxact->prepare_end_lsn); +
Why is it ok to do this at PREPARE time? I guess the theory is that the
origin LSN is going to be from the sources PREPARE too? If so, this
needs to be commented upon here.
+/* + * ParsePrepareRecord + */ +void +ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed) +{ + TwoPhaseFileHeader *hdr; + char *bufptr; + + hdr = (TwoPhaseFileHeader *) xlrec; + bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader)); + + parsed->origin_lsn = hdr->origin_lsn; + parsed->origin_timestamp = hdr->origin_timestamp; + parsed->twophase_xid = hdr->xid; + parsed->dbId = hdr->database; + parsed->nsubxacts = hdr->nsubxacts; + parsed->ncommitrels = hdr->ncommitrels; + parsed->nabortrels = hdr->nabortrels; + parsed->nmsgs = hdr->ninvalmsgs; + + strncpy(parsed->twophase_gid, bufptr, hdr->gidlen); + bufptr += MAXALIGN(hdr->gidlen); + + parsed->subxacts = (TransactionId *) bufptr; + bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); + + parsed->commitrels = (RelFileNode *) bufptr; + bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); + + parsed->abortrels = (RelFileNode *) bufptr; + bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); + + parsed->msgs = (SharedInvalidationMessage *) bufptr; + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); +}
So this is now basically a commit record. I quite dislike duplicating
things this way. Can't we make commit records versatile enough to
represent this without problems?
/* * Reads 2PC data from xlog. During checkpoint this data will be moved to @@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid) * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED */ void -FinishPreparedTransaction(const char *gid, bool isCommit) +FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok) { GlobalTransaction gxact; PGPROC *proc; @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) /* * Validate the GID, and lock the GXACT to ensure that two backends do not * try to commit the same GID at once. + * + * During logical decoding, on the apply side, it's possible that a prepared + * transaction got aborted while decoding. In that case, we stop the + * decoding and abort the transaction immediately. However the ROLLBACK + * prepared processing still reaches the subscriber. In that case it's ok + * to have a missing gid */ - gxact = LockGXact(gid, GetUserId()); + gxact = LockGXact(gid, GetUserId(), missing_ok); + if (gxact == NULL) + { + Assert(missing_ok && !isCommit); + return; + }
I'm very doubtful it is sane to handle this at such a low level.
@@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;+ if (origin_id != InvalidRepOriginId) + { + /* recover apply progress */ + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, + false /* backward */ , false /* WAL */ ); + } +
It's unclear to me why this is necessary / a good idea?
case XLOG_XACT_PREPARE: + { + xl_xact_parsed_prepare parsed;- /* - * Currently decoding ignores PREPARE TRANSACTION and will just - * decode the transaction when the COMMIT PREPARED is sent or - * throw away the transaction's contents when a ROLLBACK PREPARED - * is received. In the future we could add code to expose prepared - * transactions in the changestream allowing for a kind of - * distributed 2PC. - */ - ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); + /* check that output plugin is capable of twophase decoding */ + if (!ctx->enable_twophase) + { + ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); + break; + } + + /* ok, parse it */ + ParsePrepareRecord(XLogRecGetInfo(buf->record), + XLogRecGetData(buf->record), &parsed); + + /* does output plugin want this particular transaction? */ + if (ctx->callbacks.filter_prepare_cb && + ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid, + parsed.twophase_gid)) + { + ReorderBufferProcessXid(reorder, parsed.twophase_xid, + buf->origptr);
We're calling ReorderBufferProcessXid() on two different xids in
different branches, is that intentional?
+ if (TransactionIdIsValid(parsed->twophase_xid) && + ReorderBufferTxnIsPrepared(ctx->reorder, + parsed->twophase_xid, parsed->twophase_gid)) + { + Assert(xid == parsed->twophase_xid); + /* we are processing COMMIT PREPARED */ + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn, parsed->twophase_gid, true); + } + else + { + /* replay actions of all transaction + subtransactions in order */ + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn); + } +}
Why do we want this via the same routine?
+bool +LogicalLockTransaction(ReorderBufferTXN *txn) +{ + bool ok = false; + + /* + * Prepared transactions and uncommitted transactions + * that have modified catalogs need to interlock with + * concurrent rollback to ensure that there are no + * issues while decoding + */ + + if (!txn_has_catalog_changes(txn)) + return true; + + /* + * Is it a prepared txn? Similar checks for uncommitted + * transactions when we start supporting them + */ + if (!txn_prepared(txn)) + return true; + + /* check cached status */ + if (txn_commit(txn)) + return true; + if (txn_rollback(txn)) + return false; + + /* + * Find the PROC that is handling this XID and add ourself as a + * decodeGroupMember + */ + if (MyProc->decodeGroupLeader == NULL) + { + PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, txn_prepared(txn)); + + /* + * If decodeGroupLeader is NULL, then the only possibility + * is that the transaction completed and went away + */ + if (proc == NULL) + { + Assert(!TransactionIdIsInProgress(txn->xid)); + if (TransactionIdDidCommit(txn->xid)) + { + txn->txn_flags |= TXN_COMMIT; + return true; + } + else + { + txn->txn_flags |= TXN_ROLLBACK; + return false; + } + } + + /* Add ourself as a decodeGroupMember */ + if (!BecomeDecodeGroupMember(proc, proc->pid, txn_prepared(txn))) + { + Assert(!TransactionIdIsInProgress(txn->xid)); + if (TransactionIdDidCommit(txn->xid)) + { + txn->txn_flags |= TXN_COMMIT; + return true; + } + else + { + txn->txn_flags |= TXN_ROLLBACK; + return false; + } + } + }
Are we ok with this low-level lock / pgproc stuff happening outside of
procarray / lock related files? Where is the locking scheme documented?
+/* ReorderBufferTXN flags */ +#define TXN_HAS_CATALOG_CHANGES 0x0001 +#define TXN_IS_SUBXACT 0x0002 +#define TXN_SERIALIZED 0x0004 +#define TXN_PREPARE 0x0008 +#define TXN_COMMIT_PREPARED 0x0010 +#define TXN_ROLLBACK_PREPARED 0x0020 +#define TXN_COMMIT 0x0040 +#define TXN_ROLLBACK 0x0080 + +/* does the txn have catalog changes */ +#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES) +/* is the txn known as a subxact? */ +#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT) +/* + * Has this transaction been spilled to disk? It's not always possible to + * deduce that fact by comparing nentries with nentries_mem, because e.g. + * subtransactions of a large transaction might get serialized together + * with the parent - if they're restored to memory they'd have + * nentries_mem == nentries. + */ +#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED) +/* is this txn prepared? */ +#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE) +/* was this prepared txn committed in the meanwhile? */ +#define txn_commit_prepared(txn) (txn->txn_flags & TXN_COMMIT_PREPARED) +/* was this prepared txn aborted in the meanwhile? */ +#define txn_rollback_prepared(txn) (txn->txn_flags & TXN_ROLLBACK_PREPARED) +/* was this txn committed in the meanwhile? */ +#define txn_commit(txn) (txn->txn_flags & TXN_COMMIT) +/* was this prepared txn aborted in the meanwhile? */ +#define txn_rollback(txn) (txn->txn_flags & TXN_ROLLBACK) +
These txn_* names seem too generic imo - fairly likely to conflict with
other pieces of code imo.
Greetings,
Andres Freund
Hi Andres,
First off: This patch has way too many different types of changes as
part of one huge commit. This needs to be split into several
pieces. First the cleanups (e.g. the fields -> flag changes), then the
individual infrastructure pieces (like the twophase.c changes, best
split into several pieces as well, the locking stuff), then the main
feature, then support for it in the output plugin. Each should have an
individual explanation about why the change is necessary and not a bad
idea.
Ok, I will break this patch into multiple logical pieces and re-submit.
On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote:
@@ -46,6 +48,9 @@ typedef struct bool skip_empty_xacts; bool xact_wrote_changes; bool only_local; + bool twophase_decoding; + bool twophase_decode_with_catalog_changes; + int decode_delay; /* seconds to sleep after every change record */This seems too big a crock to add just for testing. It'll also make the
testing timing dependent...
The idea *was* to make testing timing dependent. We wanted to simulate
the case when a rollback is issued by another backend while the
decoding is still ongoing. This allows that test case to be tested.
} TestDecodingData;
void _PG_init(void) @@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) cb->begin_cb = pg_decode_begin_txn; cb->change_cb = pg_decode_change; cb->commit_cb = pg_decode_commit_txn; + cb->abort_cb = pg_decode_abort_txn;cb->filter_by_origin_cb = pg_decode_filter; cb->shutdown_cb = pg_decode_shutdown; cb->message_cb = pg_decode_message; + cb->filter_prepare_cb = pg_filter_prepare; + cb->filter_decode_txn_cb = pg_filter_decode_txn; + cb->prepare_cb = pg_decode_prepare_txn; + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; }Why does this introduce both abort_cb and abort_prepared_cb? That seems
to conflate two separate features.
Consider the case when we have a bunch of change records to apply for
a transaction. We sent a "BEGIN" and then start decoding each change
record one by one. Now a rollback was encountered while we were
decoding. In that case it doesn't make sense to keep on decoding and
sending the change records. We immediately send a regular ABORT. We
cannot send "ROLLBACK PREPARED" because the transaction was not
prepared on the subscriber and have to send a regular ABORT instead.
And we need the "ROLLBACK PREPARED" callback for the case when a
prepared transaction gets rolled back and is encountered during the
usual WAL processing.
Please take a look at "contrib/test_decoding/t/001_twophase.pl" where
this test case is enacted.
+/* Filter out unnecessary two-phase transactions */ +static bool +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + TransactionId xid, const char *gid) +{ + TestDecodingData *data = ctx->output_plugin_private; + + /* treat all transactions as one-phase */ + if (!data->twophase_decoding) + return true; + + if (txn && txn_has_catalog_changes(txn) && + !data->twophase_decode_with_catalog_changes) + return true;What? I'm INCREDIBLY doubtful this is a sane thing to expose to output
plugins. As in, unless I hear a very very convincing reason I'm strongly
opposed.
These bools are specific to the test_decoding plugin.
Again, these are useful in testing decoding in various scenarios with
twophase decoding enabled/disabled. Testing decoding when catalog
changes are allowed/disallowed etc. Please take a look at
"contrib/test_decoding/sql/prepared.sql" for the various scenarios.
+/* + * Check if we should continue to decode this transaction. + * + * If it has aborted in the meanwhile, then there's no sense + * in decoding and sending the rest of the changes, we might + * as well ask the subscribers to abort immediately. + * + * This should be called if we are streaming a transaction + * before it's committed or if we are decoding a 2PC + * transaction. Otherwise we always decode committed + * transactions + * + * Additional checks can be added here, as needed + */ +static bool +pg_filter_decode_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn) +{ + /* + * Due to caching, repeated TransactionIdDidAbort calls + * shouldn't be that expensive + */ + if (txn != NULL && + TransactionIdIsValid(txn->xid) && + TransactionIdDidAbort(txn->xid)) + return true; + + /* if txn is NULL, filter it out */Why can this be NULL?
Depending on parameters passed to the ReorderBufferTXNByXid()
function, the txn might be NULL in some cases, especially during
restarts.
+ return (txn != NULL)? false:true;
+}This definitely shouldn't be a task for each output plugin. Even if we
want to make this configurable, I'm doubtful that it's a good idea to do
so here - make its much less likely to hit edge cases.
Agreed, I will try to add it to the core logical decoding handling.
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;+ if (!LogicalLockTransaction(txn))
+ return;It really really can't be right that this is exposed to output plugins.
This was discussed in the other thread
(http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html).
Any catalog access in any plugins need to interlock with concurrent
aborts. This is only a problem if the transaction is a prepared one or
yet uncommitted one. Rest of the majority of the cases, this function
will do nothing at all.
+ /* if decode_delay is specified, sleep with above lock held */ + if (data->decode_delay > 0) + { + elog(LOG, "sleeping for %d seconds", data->decode_delay); + pg_usleep(data->decode_delay * 1000000L); + }Really not on board.
Again, specific to test_decoding plugin. We want to test the
interlocking code for concurrent abort handling which needs to wait
out for plugins in locked state before allowing the rollback to go
ahead. Please take a look at "contrib/test_decoding/t/001_twophase.pl"
and "Waiting for backends to abort" string.
@@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);+ replorigin = (replorigin_session_origin != InvalidRepOriginId && + replorigin_session_origin != DoNotReplicateId); + + if (replorigin) + { + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); + hdr->origin_lsn = replorigin_session_origin_lsn; + hdr->origin_timestamp = replorigin_session_origin_timestamp; + } + else + { + hdr->origin_lsn = InvalidXLogRecPtr; + hdr->origin_timestamp = 0; + } + /* * If the data size exceeds MaxAllocSize, we won't be able to read it in * ReadTwoPhaseFile. Check for that now, rather than fail in the case @@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact) XLogBeginInsert(); for (record = records.head; record != NULL; record = record->next) XLogRegisterData(record->data, record->len); + + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); +Can we perhaps merge a bit of the code with the plain commit path on
this?
Given that PREPARE ROLLBACK handling is totally separate from the
regular commit code paths, wouldn't it be a little difficult?
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE); + + if (replorigin) + /* Move LSNs forward for this replication origin */ + replorigin_session_advance(replorigin_session_origin_lsn, + gxact->prepare_end_lsn); +Why is it ok to do this at PREPARE time? I guess the theory is that the
origin LSN is going to be from the sources PREPARE too? If so, this
needs to be commented upon here.
Ok, will add a comment.
+/* + * ParsePrepareRecord + */ +void +ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed) +{ + TwoPhaseFileHeader *hdr; + char *bufptr; + + hdr = (TwoPhaseFileHeader *) xlrec; + bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader)); + + parsed->origin_lsn = hdr->origin_lsn; + parsed->origin_timestamp = hdr->origin_timestamp; + parsed->twophase_xid = hdr->xid; + parsed->dbId = hdr->database; + parsed->nsubxacts = hdr->nsubxacts; + parsed->ncommitrels = hdr->ncommitrels; + parsed->nabortrels = hdr->nabortrels; + parsed->nmsgs = hdr->ninvalmsgs; + + strncpy(parsed->twophase_gid, bufptr, hdr->gidlen); + bufptr += MAXALIGN(hdr->gidlen); + + parsed->subxacts = (TransactionId *) bufptr; + bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); + + parsed->commitrels = (RelFileNode *) bufptr; + bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); + + parsed->abortrels = (RelFileNode *) bufptr; + bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); + + parsed->msgs = (SharedInvalidationMessage *) bufptr; + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); +}So this is now basically a commit record. I quite dislike duplicating
things this way. Can't we make commit records versatile enough to
represent this without problems?
Maybe we can. We have already re-used existing records for
XLOG_XACT_COMMIT_PREPARED and XLOG_XACT_ABORT_PREPARED. We can add a
flag to existing commit records to indicate that it's a PREPARE and
not a COMMIT.
/* * Reads 2PC data from xlog. During checkpoint this data will be moved to @@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid) * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED */ void -FinishPreparedTransaction(const char *gid, bool isCommit) +FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok) { GlobalTransaction gxact; PGPROC *proc; @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) /* * Validate the GID, and lock the GXACT to ensure that two backends do not * try to commit the same GID at once. + * + * During logical decoding, on the apply side, it's possible that a prepared + * transaction got aborted while decoding. In that case, we stop the + * decoding and abort the transaction immediately. However the ROLLBACK + * prepared processing still reaches the subscriber. In that case it's ok + * to have a missing gid */ - gxact = LockGXact(gid, GetUserId()); + gxact = LockGXact(gid, GetUserId(), missing_ok); + if (gxact == NULL) + { + Assert(missing_ok && !isCommit); + return; + }I'm very doubtful it is sane to handle this at such a low level.
FinishPreparedTransaction() is called directly from ProcessUtility. If
not here, where else could we do this?
@@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;+ if (origin_id != InvalidRepOriginId) + { + /* recover apply progress */ + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, + false /* backward */ , false /* WAL */ ); + } +It's unclear to me why this is necessary / a good idea?
Keeping PREPARE handling as close to regular COMMIT handling seems
like a good idea, no?
case XLOG_XACT_PREPARE: + { + xl_xact_parsed_prepare parsed;- /* - * Currently decoding ignores PREPARE TRANSACTION and will just - * decode the transaction when the COMMIT PREPARED is sent or - * throw away the transaction's contents when a ROLLBACK PREPARED - * is received. In the future we could add code to expose prepared - * transactions in the changestream allowing for a kind of - * distributed 2PC. - */ - ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); + /* check that output plugin is capable of twophase decoding */ + if (!ctx->enable_twophase) + { + ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); + break; + } + + /* ok, parse it */ + ParsePrepareRecord(XLogRecGetInfo(buf->record), + XLogRecGetData(buf->record), &parsed); + + /* does output plugin want this particular transaction? */ + if (ctx->callbacks.filter_prepare_cb && + ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid, + parsed.twophase_gid)) + { + ReorderBufferProcessXid(reorder, parsed.twophase_xid, + buf->origptr);We're calling ReorderBufferProcessXid() on two different xids in
different branches, is that intentional?
Don't think that's intentional. Maybe Stas can also provide his views on this?
+ if (TransactionIdIsValid(parsed->twophase_xid) && + ReorderBufferTxnIsPrepared(ctx->reorder, + parsed->twophase_xid, parsed->twophase_gid)) + { + Assert(xid == parsed->twophase_xid); + /* we are processing COMMIT PREPARED */ + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn, parsed->twophase_gid, true); + } + else + { + /* replay actions of all transaction + subtransactions in order */ + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn); + } +}Why do we want this via the same routine?
As I mentioned above, xl_xact_parsed_commit handles both regular
commits and also "COMMIT PREPARED". That's why one routine for them
both.
+bool +LogicalLockTransaction(ReorderBufferTXN *txn) +{ + bool ok = false; + + /* + * Prepared transactions and uncommitted transactions + * that have modified catalogs need to interlock with + * concurrent rollback to ensure that there are no + * issues while decoding + */ + + if (!txn_has_catalog_changes(txn)) + return true; + + /* + * Is it a prepared txn? Similar checks for uncommitted + * transactions when we start supporting them + */ + if (!txn_prepared(txn)) + return true; + + /* check cached status */ + if (txn_commit(txn)) + return true; + if (txn_rollback(txn)) + return false; + + /* + * Find the PROC that is handling this XID and add ourself as a + * decodeGroupMember + */ + if (MyProc->decodeGroupLeader == NULL) + { + PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, txn_prepared(txn)); + + /* + * If decodeGroupLeader is NULL, then the only possibility + * is that the transaction completed and went away + */ + if (proc == NULL) + { + Assert(!TransactionIdIsInProgress(txn->xid)); + if (TransactionIdDidCommit(txn->xid)) + { + txn->txn_flags |= TXN_COMMIT; + return true; + } + else + { + txn->txn_flags |= TXN_ROLLBACK; + return false; + } + } + + /* Add ourself as a decodeGroupMember */ + if (!BecomeDecodeGroupMember(proc, proc->pid, txn_prepared(txn))) + { + Assert(!TransactionIdIsInProgress(txn->xid)); + if (TransactionIdDidCommit(txn->xid)) + { + txn->txn_flags |= TXN_COMMIT; + return true; + } + else + { + txn->txn_flags |= TXN_ROLLBACK; + return false; + } + } + }Are we ok with this low-level lock / pgproc stuff happening outside of
procarray / lock related files? Where is the locking scheme documented?
Some details are in src/include/storage/proc.h where these fields have
been added.
This implementation is similar to the existing lockGroupLeader
implementation and uses the same locking mechanism using
LockHashPartitionLockByProc.
+/* ReorderBufferTXN flags */ +#define TXN_HAS_CATALOG_CHANGES 0x0001 +#define TXN_IS_SUBXACT 0x0002 +#define TXN_SERIALIZED 0x0004 +#define TXN_PREPARE 0x0008 +#define TXN_COMMIT_PREPARED 0x0010 +#define TXN_ROLLBACK_PREPARED 0x0020 +#define TXN_COMMIT 0x0040 +#define TXN_ROLLBACK 0x0080 + +/* does the txn have catalog changes */ +#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES) +/* is the txn known as a subxact? */ +#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT) +/* + * Has this transaction been spilled to disk? It's not always possible to + * deduce that fact by comparing nentries with nentries_mem, because e.g. + * subtransactions of a large transaction might get serialized together + * with the parent - if they're restored to memory they'd have + * nentries_mem == nentries. + */ +#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED) +/* is this txn prepared? */ +#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE) +/* was this prepared txn committed in the meanwhile? */ +#define txn_commit_prepared(txn) (txn->txn_flags & TXN_COMMIT_PREPARED) +/* was this prepared txn aborted in the meanwhile? */ +#define txn_rollback_prepared(txn) (txn->txn_flags & TXN_ROLLBACK_PREPARED) +/* was this txn committed in the meanwhile? */ +#define txn_commit(txn) (txn->txn_flags & TXN_COMMIT) +/* was this prepared txn aborted in the meanwhile? */ +#define txn_rollback(txn) (txn->txn_flags & TXN_ROLLBACK) +These txn_* names seem too generic imo - fairly likely to conflict with
other pieces of code imo.
Happy to add the RB prefix to all of them for clarity. E.g.
/* ReorderBufferTXN flags */
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
I will submit multiple patches with cleanups where needed as discussed
above soon.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi,
On 2018-02-12 13:36:16 +0530, Nikhil Sontakke wrote:
Hi Andres,
First off: This patch has way too many different types of changes as
part of one huge commit. This needs to be split into several
pieces. First the cleanups (e.g. the fields -> flag changes), then the
individual infrastructure pieces (like the twophase.c changes, best
split into several pieces as well, the locking stuff), then the main
feature, then support for it in the output plugin. Each should have an
individual explanation about why the change is necessary and not a bad
idea.Ok, I will break this patch into multiple logical pieces and re-submit.
Thanks.
On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote:
@@ -46,6 +48,9 @@ typedef struct bool skip_empty_xacts; bool xact_wrote_changes; bool only_local; + bool twophase_decoding; + bool twophase_decode_with_catalog_changes; + int decode_delay; /* seconds to sleep after every change record */This seems too big a crock to add just for testing. It'll also make the
testing timing dependent...The idea *was* to make testing timing dependent. We wanted to simulate
the case when a rollback is issued by another backend while the
decoding is still ongoing. This allows that test case to be tested.
What I mean is that this will be hell on the buildfarm because the
different animals are differently fast.
} TestDecodingData;
void _PG_init(void) @@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) cb->begin_cb = pg_decode_begin_txn; cb->change_cb = pg_decode_change; cb->commit_cb = pg_decode_commit_txn; + cb->abort_cb = pg_decode_abort_txn;cb->filter_by_origin_cb = pg_decode_filter; cb->shutdown_cb = pg_decode_shutdown; cb->message_cb = pg_decode_message; + cb->filter_prepare_cb = pg_filter_prepare; + cb->filter_decode_txn_cb = pg_filter_decode_txn; + cb->prepare_cb = pg_decode_prepare_txn; + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; }Why does this introduce both abort_cb and abort_prepared_cb? That seems
to conflate two separate features.Consider the case when we have a bunch of change records to apply for
a transaction. We sent a "BEGIN" and then start decoding each change
record one by one. Now a rollback was encountered while we were
decoding.
This will be quite the mess once streaming of changes is introduced.
+/* Filter out unnecessary two-phase transactions */ +static bool +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + TransactionId xid, const char *gid) +{ + TestDecodingData *data = ctx->output_plugin_private; + + /* treat all transactions as one-phase */ + if (!data->twophase_decoding) + return true; + + if (txn && txn_has_catalog_changes(txn) && + !data->twophase_decode_with_catalog_changes) + return true;What? I'm INCREDIBLY doubtful this is a sane thing to expose to output
plugins. As in, unless I hear a very very convincing reason I'm strongly
opposed.These bools are specific to the test_decoding plugin.
txn_has_catalog_changes() definitely isn't just exposed to
test_decoding. I think you're making the output plugin interface
massively more complicated in this patch and I think we need to push
back on that.
Again, these are useful in testing decoding in various scenarios with
twophase decoding enabled/disabled. Testing decoding when catalog
changes are allowed/disallowed etc. Please take a look at
"contrib/test_decoding/sql/prepared.sql" for the various scenarios.
I don't se ehow that addresses my concern in any sort of way.
+/* + * Check if we should continue to decode this transaction. + * + * If it has aborted in the meanwhile, then there's no sense + * in decoding and sending the rest of the changes, we might + * as well ask the subscribers to abort immediately. + * + * This should be called if we are streaming a transaction + * before it's committed or if we are decoding a 2PC + * transaction. Otherwise we always decode committed + * transactions + * + * Additional checks can be added here, as needed + */ +static bool +pg_filter_decode_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn) +{ + /* + * Due to caching, repeated TransactionIdDidAbort calls + * shouldn't be that expensive + */ + if (txn != NULL && + TransactionIdIsValid(txn->xid) && + TransactionIdDidAbort(txn->xid)) + return true; + + /* if txn is NULL, filter it out */Why can this be NULL?
Depending on parameters passed to the ReorderBufferTXNByXid()
function, the txn might be NULL in some cases, especially during
restarts.
That a) isn't an explanation why that's ok b) reasoning why this ever
needs to be exposed to the output plugin.
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;+ if (!LogicalLockTransaction(txn))
+ return;It really really can't be right that this is exposed to output plugins.
This was discussed in the other thread
(http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html).
Any catalog access in any plugins need to interlock with concurrent
aborts. This is only a problem if the transaction is a prepared one or
yet uncommitted one. Rest of the majority of the cases, this function
will do nothing at all.
That doesn't address at all that it's not ok that the output plugin
needs to handle this. Doing this in output plugins, the majority of
which are external projects, means that a) the work needs to be done
many times. b) we can't simply adjust the relevant code in a minor
release, because every output plugin needs to be changed.
+ /* if decode_delay is specified, sleep with above lock held */ + if (data->decode_delay > 0) + { + elog(LOG, "sleeping for %d seconds", data->decode_delay); + pg_usleep(data->decode_delay * 1000000L); + }Really not on board.
Again, specific to test_decoding plugin.
Again, this is not a justification. People look at the code to write
output plugins. Also see my above complaint about this going to be hell
to get right on slow buildfarm members - we're going to crank up the
sleep times to make it robust-ish.
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); +Can we perhaps merge a bit of the code with the plain commit path on
this?Given that PREPARE ROLLBACK handling is totally separate from the
regular commit code paths, wouldn't it be a little difficult?
Why? A helper function doing so ought to be doable.
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) /* * Validate the GID, and lock the GXACT to ensure that two backends do not * try to commit the same GID at once. + * + * During logical decoding, on the apply side, it's possible that a prepared + * transaction got aborted while decoding. In that case, we stop the + * decoding and abort the transaction immediately. However the ROLLBACK + * prepared processing still reaches the subscriber. In that case it's ok + * to have a missing gid */ - gxact = LockGXact(gid, GetUserId()); + gxact = LockGXact(gid, GetUserId(), missing_ok); + if (gxact == NULL) + { + Assert(missing_ok && !isCommit); + return; + }I'm very doubtful it is sane to handle this at such a low level.
FinishPreparedTransaction() is called directly from ProcessUtility. If
not here, where else could we do this?
I don't think this is something that ought to be handled at this layer
at all. You should get an error in that case, the replay logic needs to
handle that, not the low level 2pc code.
@@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;+ if (origin_id != InvalidRepOriginId) + { + /* recover apply progress */ + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, + false /* backward */ , false /* WAL */ ); + } +It's unclear to me why this is necessary / a good idea?
Keeping PREPARE handling as close to regular COMMIT handling seems
like a good idea, no?
But this code *means* something? Explain to me why it's a good idea to
advance, or don't do it.
Greetings,
Andres Freund
Hi Andres,
First off: This patch has way too many different types of changes as
part of one huge commit. This needs to be split into several
pieces. First the cleanups (e.g. the fields -> flag changes), then the
individual infrastructure pieces (like the twophase.c changes, best
split into several pieces as well, the locking stuff), then the main
feature, then support for it in the output plugin. Each should have an
individual explanation about why the change is necessary and not a bad
idea.Ok, I will break this patch into multiple logical pieces and re-submit.
Thanks.
Attached are 5 patches split up from the original patch that I had
submitted earlier.
ReorderBufferTXN_flags_cleanup_1.patch:
cleanup of the ReorderBufferTXN bools and addition of some new flags
that following patches will need.
Logical_lock_unlock_api_2.patch:
Streaming changes of uncommitted transactions and of prepared
transaction runs the risk of aborts (rollback prepared) happening
while we are decoding. It's not a problem for most transactions, but
some of the transactions which do catalog changes need to get a
consistent view of the metadata so that the decoding does not behave
in uncertain ways when such concurrent aborts occur. We came up with
the concept of a logical locking/unlocking API to safeguard access to
catalog tables. This patch contains the implementation for this
functionality.
2PC_gid_wal_and_2PC_origin_tracking_3.patch:
We now store the 2PC gid in the commit/abort records. This allows us
to send the proper gid to the downstream across restarts. We also want
to avoid receiving the prepared transaction AGAIN from the upstream
and use replorigin tracking across prepared transactions.
reorderbuffer_2PC_logic_4.patch:
Add decoding logic to understand PREPARE related wal records and
relevant changes in the reorderbuffer logic to deal with 2PC. This
includes logic to handle concurrent rollbacks while we are going
through the change buffers belonging to a prepared or uncommitted
transaction.
pgoutput_plugin_support_2PC_5.patch:
Logical protocol changes to apply and send changes via the internal
pgoutput output plugin. Includes test case and relevant documentation
changes.
Besides the above, you had feedback around the test_decoding plugin
and the use of sleep() etc. I will submit a follow-on patch for the
test_decoding plugin stuff soon.
More comments inline below.
bool only_local; + bool twophase_decoding; + bool twophase_decode_with_catalog_changes; + int decode_delay; /* seconds to sleep after every change record */This seems too big a crock to add just for testing. It'll also make the
testing timing dependent...The idea *was* to make testing timing dependent. We wanted to simulate
the case when a rollback is issued by another backend while the
decoding is still ongoing. This allows that test case to be tested.What I mean is that this will be hell on the buildfarm because the
different animals are differently fast.
Will handle this in the test_decoding plugin patch soon.
+/* Filter out unnecessary two-phase transactions */ +static bool +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + TransactionId xid, const char *gid) +{ + TestDecodingData *data = ctx->output_plugin_private; + + /* treat all transactions as one-phase */ + if (!data->twophase_decoding) + return true; + + if (txn && txn_has_catalog_changes(txn) && + !data->twophase_decode_with_catalog_changes) + return true;What? I'm INCREDIBLY doubtful this is a sane thing to expose to output
plugins. As in, unless I hear a very very convincing reason I'm strongly
opposed.These bools are specific to the test_decoding plugin.
Will handle in the test_decoding plugin patch soon.
txn_has_catalog_changes() definitely isn't just exposed to
test_decoding. I think you're making the output plugin interface
massively more complicated in this patch and I think we need to push
back on that.Again, these are useful in testing decoding in various scenarios with
twophase decoding enabled/disabled. Testing decoding when catalog
changes are allowed/disallowed etc. Please take a look at
"contrib/test_decoding/sql/prepared.sql" for the various scenarios.I don't se ehow that addresses my concern in any sort of way.
Will handle in the test_decoding plugin patch soon.
+/* + * Check if we should continue to decode this transaction. + * + * If it has aborted in the meanwhile, then there's no sense + * in decoding and sending the rest of the changes, we might + * as well ask the subscribers to abort immediately. + * + * This should be called if we are streaming a transaction + * before it's committed or if we are decoding a 2PC + * transaction. Otherwise we always decode committed + * transactions + * + * Additional checks can be added here, as needed + */ +static bool +pg_filter_decode_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn) +{ + /* + * Due to caching, repeated TransactionIdDidAbort calls + * shouldn't be that expensive + */ + if (txn != NULL && + TransactionIdIsValid(txn->xid) && + TransactionIdDidAbort(txn->xid)) + return true; + + /* if txn is NULL, filter it out */Why can this be NULL?
Depending on parameters passed to the ReorderBufferTXNByXid()
function, the txn might be NULL in some cases, especially during
restarts.That a) isn't an explanation why that's ok b) reasoning why this ever
needs to be exposed to the output plugin.
Removing this pg_filter_decode_txn() function. You are right, there's
no need to expose this function to the output plugin and we can make
the decision entirely inside the ReorderBuffer code handling.
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;+ if (!LogicalLockTransaction(txn))
+ return;It really really can't be right that this is exposed to output plugins.
This was discussed in the other thread
(http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html).
Any catalog access in any plugins need to interlock with concurrent
aborts. This is only a problem if the transaction is a prepared one or
yet uncommitted one. Rest of the majority of the cases, this function
will do nothing at all.That doesn't address at all that it's not ok that the output plugin
needs to handle this. Doing this in output plugins, the majority of
which are external projects, means that a) the work needs to be done
many times. b) we can't simply adjust the relevant code in a minor
release, because every output plugin needs to be changed.
How do we know if the external project is going to access catalog
data? How do we ensure that the data that they access is safe from
concurrent aborts if we are decoding uncommitted or prepared
transactions? We are providing a guideline here and recommending them
to use these APIs if they need to.
+ /* if decode_delay is specified, sleep with above lock held */ + if (data->decode_delay > 0) + { + elog(LOG, "sleeping for %d seconds", data->decode_delay); + pg_usleep(data->decode_delay * 1000000L); + }Really not on board.
Again, specific to test_decoding plugin.
Again, this is not a justification. People look at the code to write
output plugins. Also see my above complaint about this going to be hell
to get right on slow buildfarm members - we're going to crank up the
sleep times to make it robust-ish.
Sure, as mentioned above, will come up with a different way for the
test_decoding plugin later.
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); +Can we perhaps merge a bit of the code with the plain commit path on
this?Given that PREPARE ROLLBACK handling is totally separate from the
regular commit code paths, wouldn't it be a little difficult?Why? A helper function doing so ought to be doable.
Can you elaborate on what exactly you mean here?
@@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) /* * Validate the GID, and lock the GXACT to ensure that two backends do not * try to commit the same GID at once. + * + * During logical decoding, on the apply side, it's possible that a prepared + * transaction got aborted while decoding. In that case, we stop the + * decoding and abort the transaction immediately. However the ROLLBACK + * prepared processing still reaches the subscriber. In that case it's ok + * to have a missing gid */ - gxact = LockGXact(gid, GetUserId()); + gxact = LockGXact(gid, GetUserId(), missing_ok); + if (gxact == NULL) + { + Assert(missing_ok && !isCommit); + return; + }I'm very doubtful it is sane to handle this at such a low level.
FinishPreparedTransaction() is called directly from ProcessUtility. If
not here, where else could we do this?I don't think this is something that ought to be handled at this layer
at all. You should get an error in that case, the replay logic needs to
handle that, not the low level 2pc code.
Removed the above changes. The replay logic now checks if the GID
still exists in the abort rollback codepath. If not, it returns
immediately. In case of commit rollback replay, the GID has to
obviously exist at the downstream.
@@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;+ if (origin_id != InvalidRepOriginId) + { + /* recover apply progress */ + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, + false /* backward */ , false /* WAL */ ); + } +It's unclear to me why this is necessary / a good idea?
Keeping PREPARE handling as close to regular COMMIT handling seems
like a good idea, no?But this code *means* something? Explain to me why it's a good idea to
advance, or don't do it.
We want to do this to use it as protection against receiving the
prepared tx again.
Other than the above,
*) Changed the flags and added "RB" prefix to all flags and macros.
*) Added a few fields into existing xl_xact_parsed_commit record and avoided
creating an entirely new xl_xact_parsed_prepare record.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
2PC_gid_wal_and_2PC_origin_tracking_3.patchapplication/octet-stream; name=2PC_gid_wal_and_2PC_origin_tracking_3.patchDownload
commit f75e729376ed5215aa2e5aceee35a056029909c3
Author: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed Feb 28 10:49:13 2018 +0530
Add support for logging GID in commit/abort WAL records for 2PC
transactions. Also support replica origin tracking for 2PC
Store GID of 2PC in commit/abort WAL records. This allows logical
decoding to send the SAME gid to subscribers across restarts.
We also track origin replica replay progress for 2PC now.
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..d6e4b7980f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -898,7 +896,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +912,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1065,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1076,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1123,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1308,44 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->nrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->xnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1435,11 +1498,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1816,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2230,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2259,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2321,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2345,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2376,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2426,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..93c00e1c0a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5260,7 +5261,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5272,6 +5274,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5334,6 +5337,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5384,7 +5394,16 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5405,15 +5424,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5449,6 +5472,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5463,6 +5511,10 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5480,7 +5532,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
return XLogInsert(RM_XACT_ID, info);
}
@@ -5803,7 +5870,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..f05cde202f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
@@ -54,7 +57,7 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
extern void FinishPreparedTransaction(const char *gid, bool isCommit);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6445bbc46f..61c4ae37f3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,11 +310,16 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+ int nabortrels; /* only for 2PC */
+ RelFileNode *abortnodes; /* only for 2PC */
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
TimestampTz xact_time;
@@ -386,12 +399,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid,
+ const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
ReorderBufferTXN_flags_cleanup_1.patchapplication/octet-stream; name=ReorderBufferTXN_flags_cleanup_1.patchDownload
commit 6d9568f5fedba67a1428097f732f564e73e13d43
Author: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon Feb 26 16:22:52 2018 +0530
Cleaning up and addition of new flags in ReorderBufferTXN structure
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c72a611a39..d22e116aa1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -623,7 +623,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -641,7 +641,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -675,9 +675,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -738,9 +738,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -849,7 +849,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -878,7 +878,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1044,7 +1044,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1083,7 +1083,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1098,7 +1098,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1115,7 +1115,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1688,7 +1688,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1934,7 +1934,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1951,7 +1951,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2095,7 +2095,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0970abca52..d6b00654c2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,48 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +241,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
Logical_lock_unlock_api_2.patchapplication/octet-stream; name=Logical_lock_unlock_api_2.patchDownload
commit b472dce9ea645c93f232d6dcd549d4c3ba1f6bf1
Author: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue Feb 27 11:40:31 2018 +0530
Introduce LogicalLockTransaction/LogicalUnlockTransaction APIs
Prepared transactions and uncommitted transactions that have modified
catalogs need to interlock with concurrent rollback to ensure that
there are no issues while decoding.
Implementation is via adding support for decoding groups. Use
LockHashPartitionLockByProc on the group leader to get the LWLock
protecting these fields. For prepared and uncommitted transactions,
decoding backends working on the same XID will link themselves up
to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7637efc32e..c8ccade241 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1013,3 +1013,164 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return true;
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Find the PROC that is handling this XID and add ourself as a
+ * decodeGroupMember
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, rbtxn_prepared(txn));
+
+ /*
+ * If decodeGroupLeader is NULL, then the only possibility
+ * is that the transaction completed and went away
+ */
+ if (proc == NULL)
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+
+ /* Add ourself as a decodeGroupMember */
+ if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us. Check if the transaction is still around
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ if (MyProc->decodeGroupLeader)
+ {
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+ else
+ return false;
+
+ return ok;
+}
+
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return;
+ if (rbtxn_rollback(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..26d35c7807 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -1887,3 +1904,268 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * BecomeDecodeGroupLeader - designate process as decode group leader
+ *
+ * Once this function has returned, other processes can join the decode group
+ * by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+
+ /*
+ * This proc will become decodeGroupLeader if it's
+ * not already
+ */
+ if (proc && proc->decodeGroupLeader != proc)
+ {
+ volatile PGXACT *pgxact;
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+ if (proc->pid == pid && pgxact->xid == xid)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader != proc)
+ {
+ /* We had better not be a follower. */
+ Assert(proc->decodeGroupLeader == NULL);
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Acquire lock
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Assumes that the caller is holding appropriate lock
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * Indicate to all decodeGroupMembers that this transaction is
+ * going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning
+ * from here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been
+ * removed from the procArray.
+ *
+ * If the transaction is committing, it's ok for the
+ * decoders to continue merrily. When it tries to lock this
+ * proc, it won't find it and check for transaction status
+ * and cache the commit status for future calls in
+ * LogicalLockTransaction
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* mark ourself as aborting */
+ if (!isCommit)
+ leader->decodeAbortPending = true;
+
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+ /* mark the proc to indicate abort is pending */
+ if (proc == leader)
+ continue;
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..fdfc582874 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -326,5 +346,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
pgoutput_plugin_support_2PC_5.patchapplication/octet-stream; name=pgoutput_plugin_support_2PC_5.patchDownload
commit 51ed62a7acc749e2facf8db7d11148069abceab6
Author: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed Feb 28 17:52:51 2018 +0530
pgoutput output plugin support for logical decoding of 2PC.
Includes documentation and test cases.
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5501eed108..7edda72e5e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,8 +384,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +460,12 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding will be aborted midways.
</para>
<note>
@@ -550,6 +561,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -559,12 +638,30 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. The <function>change_cb</function> call should invoke
+ <function>LogicalLockTransaction</function> function before such access of
+ system or user catalog tables. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ function interlocks the decoding activity with simultaneous rollback by
+ another backend of this very same transaction. The
+ <function>change_cb</function> should invoke
+ <function>LogicalUnlockTransaction</function> function immediately after
+ the catalog tables access.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
Relation relation,
ReorderBufferChange *change);
+</programlisting>
+ Here's an example of the use of <function>LogicalLockTransaction</function>
+ and <function>LogicalUnlockTransaction</function> in an output plugin:
+<programlisting>
+ if (!LogicalLockTransaction(txn))
+ return;
+ relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
</programlisting>
The <parameter>ctx</parameter> and <parameter>txn</parameter> parameters
have the same contents as for the <function>begin_cb</function>
@@ -614,6 +711,53 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-decode">
+ <title>Decode Filter Callback</title>
+
+ <para>
+ The optional <function>filter_decode_txn_cb</function> callback
+ is called to determine whether data that is part of the current
+ transaction should be continued to be decoded.
+<programlisting>
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction, like its XID.
+ Note however that it can be NULL in some cases. To signal that decoding process
+ should terminate, return true; false otherwise.
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return true; false otherwise.
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called. To signal that decoding should be skipped, return true; false otherwise.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f3091af385..e2db0ebf77 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -549,6 +549,38 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
}
+/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is
+ * around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs */
+ if (!gxact->valid)
+ continue;
+ if (strcmp(gxact->gid, gid) != 0)
+ continue;
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return false;
+}
+
/*
* LockGXact
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index db93d3927b..0e57cda2c6 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -886,6 +886,10 @@ filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
ErrorContextCallback errcallback;
bool ret;
+ /* if callback is not present, return false */
+ if (ctx->callbacks.filter_decode_txn_cb == NULL)
+ return false;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "filter_decode_txn";
@@ -915,6 +919,10 @@ filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
ErrorContextCallback errcallback;
bool ret;
+ /* If twophase is not enabled, return true */
+ if (!ctx->enable_twophase)
+ return true;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "filter_prepare";
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..cae4c72fed 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (rbtxn_commit_prepared(txn))
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (rbtxn_prepared(txn))
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04985c9f91..5f0b40e760 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,132 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ /*
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
+ */
+ if (LookupGXact(commit_data->gid))
+ {
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+ FinishPreparedTransaction(commit_data->gid, false);
+ CommitTransactionCommand();
+ }
+
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1015,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d538f25ede..9ba68ef248 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,23 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static bool pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,7 +91,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
+ cb->filter_decode_txn_cb = pgoutput_decode_txn_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -251,6 +271,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
/*
* Sends the decoded DML over wire.
*/
@@ -364,6 +439,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
MemoryContextReset(data->context);
}
+/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
/*
* Currently we always forward.
*/
@@ -374,6 +461,37 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
return false;
}
+/*
+ * Check if we should continue to decode this transaction.
+ *
+ * If it has aborted in the meanwhile, then there's no sense
+ * in decoding and sending the rest of the changes, we might
+ * as well ask the subscribers to abort immediately.
+ *
+ * This should be called if we are streaming a transaction
+ * before it's committed or if we are decoding a 2PC
+ * transaction. Otherwise we always decode committed
+ * transactions
+ *
+ * Additional checks can be added here, as needed
+ */
+static bool
+pgoutput_decode_txn_filter(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn)
+{
+ /*
+ * Due to caching, repeated TransactionIdDidAbort calls
+ * shouldn't be that expensive
+ */
+ if (txn != NULL &&
+ TransactionIdIsValid(txn->xid) &&
+ TransactionIdDidAbort(txn->xid))
+ return true;
+
+ /* if txn is NULL, filter it out */
+ return (txn != NULL)? false:true;
+}
+
/*
* Shutdown the output plugin.
*
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f05cde202f..5a4da6efab 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
extern void StartPrepare(GlobalTransaction gxact);
extern void EndPrepare(GlobalTransaction gxact);
extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0eb21057c5..886025f3aa 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/010_twophase.pl b/src/test/subscription/t/010_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/010_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
reorderbuffer_2PC_logic_4.patchapplication/octet-stream; name=reorderbuffer_2PC_logic_4.patchDownload
commit b311db06373d069fac2697122fe54eb2b62963ee
Author: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed Feb 28 13:47:40 2018 +0530
Teach ReorderBuffer to deal with 2PC.
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..f3091af385 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1506,6 +1506,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..b45739d971 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,71 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
+ /* tell the reorderbuffer about the surviving subtransactions */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +723,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c8ccade241..db93d3927b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,18 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_decode_txn_cb_wrapper(ReorderBuffer *cache,
+ ReorderBufferTXN *txn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +197,27 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_decode_txn = filter_decode_txn_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -693,6 +725,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -730,6 +878,63 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_decode_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_decode_txn";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_decode_txn_cb(ctx, txn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d22e116aa1..9d9ce0438d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1264,25 +1264,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1326,20 +1319,62 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = rbtxn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed. Otherwise we need to consult the
+ * decode_txn filter function to enquire if it's
+ * still ok for us to continue to decode this xid
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && rb->filter_decode_txn(rb, txn))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1375,7 +1410,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
@@ -1546,6 +1591,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1607,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1651,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1623,6 +1691,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 61c4ae37f3..836ccefee6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -322,6 +322,9 @@ typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -332,6 +335,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..9dad4c997f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 78fd38bb16..61c5019adf 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -67,6 +67,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -84,6 +124,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ctx,
RepOriginId origin_id);
+/*
+ * Filter to check if we should continue to decode this transaction
+ */
+typedef bool (*LogicalDecodeFilterDecodeTxnCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn);
+
/*
* Called to shutdown an output plugin.
*/
@@ -98,8 +144,14 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
+ LogicalDecodeFilterDecodeTxnCB filter_decode_txn_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d6b00654c2..40072de297 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,41 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterDecodeTxnCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +386,12 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterDecodeTxnCB filter_decode_txn;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -389,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -412,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
Hi,
On 2018-02-28 21:12:42 +0530, Nikhil Sontakke wrote:
Attached are 5 patches split up from the original patch that I had
submitted earlier.
In the future you should number them. Right now they appear to be out of
order in your email. I suggest using git format-patch, that does all
the necessary work for you.
Greetings,
Andres Freund
On 2 March 2018 at 08:53, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-02-28 21:12:42 +0530, Nikhil Sontakke wrote:
Attached are 5 patches split up from the original patch that I had
submitted earlier.In the future you should number them. Right now they appear to be out of
order in your email. I suggest using git format-patch, that does all
the necessary work for you.Yep, specially git format-patch with a -v argument, so the whole patchset
is visibly versioned and sorts in the correct order.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi Andres and Craig,
In the future you should number them. Right now they appear to be out of
order in your email. I suggest using git format-patch, that does all
the necessary work for you.Yep, specially git format-patch with a -v argument, so the whole patchset is
visibly versioned and sorts in the correct order.
I did try to use *_Number.patch to convey the sequence, but admittedly
it's pretty lame.
I will re-submit with "git format-patch" soon.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi Andres,
I will re-submit with "git format-patch" soon.
PFA, patches in "format-patch" format.
This patch set also includes changes in the test_decoding plugin along
with an additional savepoint related test case that was pointed out on
this thread, upstream.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0006-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0006-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From fbb387e50b6ca0a65d51d875d4878db719319d14 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:55:25 +0530
Subject: [PATCH 6/6] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 262 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 90 +++++++++-
contrib/test_decoding/test_decoding.c | 145 ++++++++++++++-
3 files changed, 466 insertions(+), 31 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..4086a23f63 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,85 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''enable-twophase'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +92,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+:get_with2pc
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +286,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..cb32abd740 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,35 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''enable-twophase'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +40,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+:get_no2pc
+:get_with2pc
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+:get_with2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- show results. There should be nothing to show
+:get_no2pc
+:get_with2pc
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0f18afa852..3c22bd6a18 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -24,6 +24,8 @@
#include "replication/message.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -46,6 +48,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -59,6 +62,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -68,6 +73,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -85,9 +102,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -107,6 +129,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -156,7 +179,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "skip-empty-xacts") == 0)
{
-
if (elem->arg == NULL)
data->skip_empty_xacts = true;
else if (!parse_bool(strVal(elem->arg), &data->skip_empty_xacts))
@@ -167,7 +189,6 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
}
else if (strcmp(elem->defname, "only-local") == 0)
{
-
if (elem->arg == NULL)
data->only_local = true;
else if (!parse_bool(strVal(elem->arg), &data->only_local))
@@ -176,6 +197,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -244,6 +275,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out unnecessary two-phase transactions */
+static bool
+pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,8 +546,12 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ if (!LogicalLockTransaction(txn))
+ return;
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ LogicalUnlockTransaction(txn);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
--
2.14.3 (Apple Git-98)
0005-pgoutput-output-plugin-support-for-logical-decoding-.patchapplication/octet-stream; name=0005-pgoutput-output-plugin-support-for-logical-decoding-.patchDownload
From 849175e102275cc7d0e104149f31e2ba51c3d73a Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:53:13 +0530
Subject: [PATCH 5/6] pgoutput output plugin support for logical decoding of
2PC.
Includes documentation and test cases.
---
doc/src/sgml/logicaldecoding.sgml | 128 ++++++++++++++++++++++++-
src/backend/access/transam/twophase.c | 32 +++++++
src/backend/replication/logical/proto.c | 100 ++++++++++++++++++--
src/backend/replication/logical/worker.c | 141 +++++++++++++++++++++++++++-
src/backend/replication/pgoutput/pgoutput.c | 84 +++++++++++++++++
src/include/access/twophase.h | 1 +
src/include/replication/logicalproto.h | 17 +++-
7 files changed, 490 insertions(+), 13 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5501eed108..0c9a51ae49 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,12 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding will be aborted midways.
</para>
<note>
@@ -550,6 +560,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -559,12 +637,30 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. The <function>change_cb</function> call should invoke
+ <function>LogicalLockTransaction</function> function before such access of
+ system or user catalog tables. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ function interlocks the decoding activity with simultaneous rollback by
+ another backend of this very same transaction. The
+ <function>change_cb</function> should invoke
+ <function>LogicalUnlockTransaction</function> function immediately after
+ the catalog tables access.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
Relation relation,
ReorderBufferChange *change);
+</programlisting>
+ Here's an example of the use of <function>LogicalLockTransaction</function>
+ and <function>LogicalUnlockTransaction</function> in an output plugin:
+<programlisting>
+ if (!LogicalLockTransaction(txn))
+ return;
+ relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
</programlisting>
The <parameter>ctx</parameter> and <parameter>txn</parameter> parameters
have the same contents as for the <function>begin_cb</function>
@@ -614,6 +710,34 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return true; false otherwise.
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called. To signal that decoding should be skipped, return true; false otherwise.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f3091af385..e2db0ebf77 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -549,6 +549,38 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
}
+/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is
+ * around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs */
+ if (!gxact->valid)
+ continue;
+ if (strcmp(gxact->gid, gid) != 0)
+ continue;
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return false;
+}
+
/*
* LockGXact
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..cae4c72fed 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,10 +72,11 @@ void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
- uint8 flags = 0;
+ uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ flags |= LOGICALREP_IS_COMMIT;
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,21 +87,106 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Write ABORT to the output stream.
+ */
+void
+logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'C'); /* sending ABORT flag below */
+
+ flags |= LOGICALREP_IS_ABORT;
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, abort_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+}
+
+/*
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
-logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
+logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data,
+ uint8 *flags)
{
- /* read flags (unused for now) */
- uint8 flags = pq_getmsgbyte(in);
+ /* read flags */
+ uint8 commit_flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!(commit_flags & LOGICALREP_COMMIT_MASK))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ commit_flags);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
commit_data->end_lsn = pq_getmsgint64(in);
commit_data->committime = pq_getmsgint64(in);
+
+ /* set gid to empty */
+ commit_data->gid[0] = '\0';
+
+ *flags = commit_flags;
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ if (rbtxn_commit_prepared(txn))
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else if (rbtxn_prepared(txn))
+ flags |= LOGICALREP_IS_PREPARE;
+
+ if (flags == 0)
+ elog(ERROR, "unrecognized flags %u in [commit|rollback] prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepCommitData *commit_data, uint8 *flags)
+{
+ /* read flags */
+ uint8 prep_flags = pq_getmsgbyte(in);
+
+ if (!(prep_flags & LOGICALREP_PREPARE_MASK))
+ elog(ERROR, "unrecognized flags %u in prepare message", prep_flags);
+
+ /* read fields */
+ commit_data->commit_lsn = pq_getmsgint64(in);
+ commit_data->end_lsn = pq_getmsgint64(in);
+ commit_data->committime = pq_getmsgint64(in);
+
+ /* read gid */
+ strcpy(commit_data->gid, pq_getmsgstring(in));
+
+ /* set flags */
+ *flags = prep_flags;
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04985c9f91..5f0b40e760 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -452,8 +452,9 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ uint8 flags = 0;
- logicalrep_read_commit(s, &commit_data);
+ logicalrep_read_commit(s, &commit_data, &flags);
Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -467,7 +468,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (flags & LOGICALREP_IS_COMMIT)
+ CommitTransactionCommand();
+ else if (flags & LOGICALREP_IS_ABORT)
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -487,6 +492,132 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepCommitData *commit_data)
+{
+ Assert(commit_data->commit_lsn == remote_final_lsn);
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ PrepareTransactionBlock(commit_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepCommitData *commit_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ FinishPreparedTransaction(commit_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepCommitData *commit_data)
+{
+
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = commit_data->end_lsn;
+ replorigin_session_origin_timestamp = commit_data->committime;
+
+ /*
+ * During logical decoding, on the apply side, it's possible that a prepared
+ * transaction got aborted while decoding. In that case, we stop the
+ * decoding and abort the transaction immediately. However the ROLLBACK
+ * prepared processing still reaches the subscriber. In that case it's ok
+ * to have a missing gid
+ */
+ if (LookupGXact(commit_data->gid))
+ {
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+ FinishPreparedTransaction(commit_data->gid, false);
+ CommitTransactionCommand();
+ }
+
+ pgstat_report_stat(false);
+
+ store_flush_position(commit_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(commit_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepCommitData commit_data;
+ uint8 flags = 0;
+
+ logicalrep_read_prepare(s, &commit_data, &flags);
+
+ if (flags & LOGICALREP_IS_PREPARE)
+ apply_handle_prepare_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_COMMIT_PREPARED)
+ apply_handle_commit_prepared_txn(&commit_data);
+ else if (flags & LOGICALREP_IS_ROLLBACK_PREPARED)
+ apply_handle_rollback_prepared_txn(&commit_data);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("wrong [commit|rollback] prepare message")));
+}
+
/*
* Handle ORIGIN message.
*
@@ -884,10 +1015,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT|ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* [COMMIT|ROLLBACK] PREPARE */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d538f25ede..2c04766888 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -37,11 +37,21 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static bool pgoutput_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, TransactionId xid, const char *gid);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -79,6 +89,13 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->filter_prepare_cb = pgoutput_filter_prepare;
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -251,6 +268,61 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_abort(ctx->out, txn, abort_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
/*
* Sends the decoded DML over wire.
*/
@@ -364,6 +436,18 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
MemoryContextReset(data->context);
}
+/*
+ * Filter out unnecessary two-phase transactions.
+ *
+ * Currently, we forward all two-phase transactions
+ */
+static bool
+pgoutput_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ return false;
+}
+
/*
* Currently we always forward.
*/
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f05cde202f..5a4da6efab 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
extern void StartPrepare(GlobalTransaction gxact);
extern void EndPrepare(GlobalTransaction gxact);
extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0eb21057c5..886025f3aa 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,11 +69,20 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+#define LOGICALREP_IS_PREPARE 0x04
+#define LOGICALREP_IS_COMMIT_PREPARED 0x08
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x10
+#define LOGICALREP_COMMIT_MASK (LOGICALREP_IS_COMMIT | LOGICALREP_IS_ABORT)
+#define LOGICALREP_PREPARE_MASK (LOGICALREP_IS_PREPARE | LOGICALREP_IS_COMMIT_PREPARED | LOGICALREP_IS_ROLLBACK_PREPARED)
typedef struct LogicalRepCommitData
{
+ uint8 flag;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
+ char gid[GIDSIZE];
} LogicalRepCommitData;
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
@@ -81,8 +90,14 @@ extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+extern void logicalrep_write_abort(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
extern void logicalrep_read_commit(StringInfo in,
- LogicalRepCommitData *commit_data);
+ LogicalRepCommitData *commit_data, uint8 *flags);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepCommitData *commit_data, uint8 *flags);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
--
2.14.3 (Apple Git-98)
0004-Teach-ReorderBuffer-to-deal-with-2PC.patchapplication/octet-stream; name=0004-Teach-ReorderBuffer-to-deal-with-2PC.patchDownload
From 829f7bd6ae0b159565b165ccc6facb048cda3552 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:51:52 +0530
Subject: [PATCH 4/6] Teach ReorderBuffer to deal with 2PC.
---
src/backend/access/transam/twophase.c | 5 +
src/backend/replication/logical/decode.c | 135 ++++++++++++--
src/backend/replication/logical/logical.c | 181 +++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 226 ++++++++++++++++++++++--
src/include/access/xact.h | 7 +
src/include/replication/logical.h | 5 +
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 632 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..f3091af385 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1506,6 +1506,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..e1b021750f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,78 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare *parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need
+ * to do this because the main transaction itself has not committed
+ * since we are in the prepare phase right now. So we need to be sure
+ * the snapshot is setup correctly for the main transaction in case all
+ * changes happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +730,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c8ccade241..4e8f77201c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,26 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /* check that plugin implements all callbacks necessary to perform 2PC */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ if (twophase_callbacks != 3 && twophase_callbacks != 0)
+ ereport(WARNING,
+ (errmsg("Output plugin registered only %d twophase callbacks. "
+ "Twophase transactions will be decoded at commit time.",
+ twophase_callbacks)));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -693,6 +722,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+ static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -730,6 +875,42 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * If twophase is not enabled, skip decoding at
+ * PREPARE time
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d22e116aa1..573e1aa90a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1264,25 +1264,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time,
RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1326,20 +1319,60 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
+ bool change_cleanup = false;
+ bool check_txn_status,
+ apply_started = false;
+ bool is_prepared = rbtxn_prepared(txn);
+
+ /*
+ * check for the xid once to see if it's already
+ * committed.
+ *
+ * This is to handle cases of concurrent abort
+ * happening parallel to the decode activity
+ */
+ check_txn_status = TransactionIdDidCommit(txn->xid)?
+ false : true;
if (using_subtxn)
BeginInternalSubTransaction("replay");
else
StartTransactionCommand();
- rb->begin(rb, txn);
-
iterstate = ReorderBufferIterTXNInit(rb, txn);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
Relation relation = NULL;
Oid reloid;
+ /*
+ * While decoding 2PC or while streaming uncommitted
+ * transactions, check if this transaction needs to
+ * be still decoded. If the transaction got aborted
+ * or if we were instructed to stop decoding, then
+ * bail out early.
+ */
+ if (check_txn_status && TransactionIdDidAbort(txn->xid))
+ {
+ elog(LOG, "%s decoding of %s (%u)",
+ apply_started? "stopping":"skipping",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
+
+ /*
+ * We have decided to apply changes based on the go
+ * ahead from the above decode filter, BEGIN the
+ * transaction on the other side
+ */
+ if (apply_started == false)
+ {
+ rb->begin(rb, txn);
+ apply_started = true;
+ }
+
switch (change->action)
{
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
@@ -1375,7 +1408,17 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ is_prepared? txn->gid:"",
+ txn->xid);
+ change_cleanup = true;
+ goto change_cleanuptxn;
+ }
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
@@ -1546,6 +1589,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
}
+change_cleanuptxn:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1561,8 +1605,24 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ if (change_cleanup)
+ {
+ /* call abort if we have sent any changes */
+ if (apply_started)
+ rb->abort(rb, txn, commit_lsn);
+ }
+ else
+ {
+ /* call commit or prepare callback */
+ if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+ }
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1589,7 +1649,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions.
+ * This is because the COMMIT PREPARED needs
+ * no data post the successful PREPARE
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1623,6 +1689,136 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare
+ * filter to give us the *same* response for a given xid
+ * across multiple calls (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for
+ * example). Anyways, 2PC transactions do not contain any
+ * reorderbuffers. So allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 61c4ae37f3..836ccefee6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -322,6 +322,9 @@ typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
typedef struct xl_xact_parsed_abort
{
+ Oid dbId;
+ Oid tsId;
+
TimestampTz xact_time;
uint32 xinfo;
@@ -332,6 +335,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..9dad4c997f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 78fd38bb16..568897df49 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -67,6 +67,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare
+ * and commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -98,7 +138,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d6b00654c2..d446d9082b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -389,6 +429,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -412,6 +457,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.14.3 (Apple Git-98)
0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco.patchapplication/octet-stream; name=0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco.patchDownload
From 2b22ffc1937cace81a063db23d9e0a339f5d74a2 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:43:22 +0530
Subject: [PATCH 3/6] Add support for logging GID in commit/abort WAL records
for 2PC transactions. Also support replica origin tracking for 2PC
Store GID of 2PC in commit/abort WAL records. This allows logical
decoding to send the SAME gid to subscribers across restarts.
We also track origin replica replay progress for 2PC now. This is
to avoid resending a PREPARE TRANSACTION from the upstream
---
src/backend/access/rmgrdesc/xactdesc.c | 39 ++++++++++++
src/backend/access/transam/twophase.c | 105 ++++++++++++++++++++++++++++-----
src/backend/access/transam/xact.c | 78 ++++++++++++++++++++++--
src/include/access/twophase.h | 5 +-
src/include/access/xact.h | 19 +++++-
5 files changed, 223 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..d6e4b7980f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -898,7 +896,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +912,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1065,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1076,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1123,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1308,44 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->nrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->xnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1435,11 +1498,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1816,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2230,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2259,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2321,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2345,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2376,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2426,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..93c00e1c0a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5260,7 +5261,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5272,6 +5274,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5334,6 +5337,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5384,7 +5394,16 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5405,15 +5424,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5449,6 +5472,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5463,6 +5511,10 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5480,7 +5532,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
return XLogInsert(RM_XACT_ID, info);
}
@@ -5803,7 +5870,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..f05cde202f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
@@ -54,7 +57,7 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
extern void FinishPreparedTransaction(const char *gid, bool isCommit);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6445bbc46f..61c4ae37f3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -302,11 +310,16 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+ int nabortrels; /* only for 2PC */
+ RelFileNode *abortnodes; /* only for 2PC */
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
TimestampTz xact_time;
@@ -386,12 +399,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid,
+ const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
--
2.14.3 (Apple Git-98)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchDownload
From 62b1e2c0e4efd0fe64b90f610e2ba22c1692f385 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:42:17 +0530
Subject: [PATCH 2/6] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
Prepared transactions and uncommitted transactions that have modified
catalogs need to interlock with concurrent rollback to ensure that
there are no issues while decoding.
Implementation is via adding support for decoding groups. Use
LockHashPartitionLockByProc on the group leader to get the LWLock
protecting these fields. For prepared and uncommitted transactions,
decoding backends working on the same XID will link themselves up
to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately
---
src/backend/replication/logical/logical.c | 161 +++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++++
src/backend/storage/lmgr/proc.c | 282 ++++++++++++++++++++++++++++++
src/include/replication/logical.h | 2 +
src/include/storage/proc.h | 25 +++
src/include/storage/procarray.h | 1 +
6 files changed, 510 insertions(+)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7637efc32e..c8ccade241 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1013,3 +1013,164 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return true;
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Find the PROC that is handling this XID and add ourself as a
+ * decodeGroupMember
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, rbtxn_prepared(txn));
+
+ /*
+ * If decodeGroupLeader is NULL, then the only possibility
+ * is that the transaction completed and went away
+ */
+ if (proc == NULL)
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+
+ /* Add ourself as a decodeGroupMember */
+ if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us. Check if the transaction is still around
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ if (MyProc->decodeGroupLeader)
+ {
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+ else
+ return false;
+
+ return ok;
+}
+
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return;
+ if (rbtxn_rollback(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..26d35c7807 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -1887,3 +1904,268 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * BecomeDecodeGroupLeader - designate process as decode group leader
+ *
+ * Once this function has returned, other processes can join the decode group
+ * by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+
+ /*
+ * This proc will become decodeGroupLeader if it's
+ * not already
+ */
+ if (proc && proc->decodeGroupLeader != proc)
+ {
+ volatile PGXACT *pgxact;
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+ if (proc->pid == pid && pgxact->xid == xid)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader != proc)
+ {
+ /* We had better not be a follower. */
+ Assert(proc->decodeGroupLeader == NULL);
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Acquire lock
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Assumes that the caller is holding appropriate lock
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * Indicate to all decodeGroupMembers that this transaction is
+ * going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning
+ * from here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been
+ * removed from the procArray.
+ *
+ * If the transaction is committing, it's ok for the
+ * decoders to continue merrily. When it tries to lock this
+ * proc, it won't find it and check for transaction status
+ * and cache the commit status for future calls in
+ * LogicalLockTransaction
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* mark ourself as aborting */
+ if (!isCommit)
+ leader->decodeAbortPending = true;
+
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+ /* mark the proc to indicate abort is pending */
+ if (proc == leader)
+ continue;
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..fdfc582874 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -326,5 +346,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.14.3 (Apple Git-98)
0001-Cleaning-up-and-addition-of-new-flags-in-ReorderBuff.patchapplication/octet-stream; name=0001-Cleaning-up-and-addition-of-new-flags-in-ReorderBuff.patchDownload
From c834c29bdd5f15fba248224069f15181dedcc9b9 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Mon, 5 Mar 2018 21:41:17 +0530
Subject: [PATCH 1/6] Cleaning up and addition of new flags in ReorderBufferTXN
structure
---
src/backend/replication/logical/reorderbuffer.c | 32 +++++++--------
src/include/replication/reorderbuffer.h | 52 +++++++++++++++++--------
2 files changed, 51 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c72a611a39..d22e116aa1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -623,7 +623,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -641,7 +641,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -675,9 +675,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -738,9 +738,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -849,7 +849,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -878,7 +878,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1044,7 +1044,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1083,7 +1083,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1098,7 +1098,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1115,7 +1115,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1688,7 +1688,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1934,7 +1934,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1951,7 +1951,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2095,7 +2095,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0970abca52..d6b00654c2 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,48 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +241,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.14.3 (Apple Git-98)
Hi Nikhil,
I've been looking at this patch over the past few days, so here are my
thoughts so far ...
decoding aborted transactions
=============================
First, let's talk about handling of aborted transaction, which was
originally discussed in thread [1]/messages/by-id/CAMGcDxeHBaXCz12LdfEmyJdghbms_dtC26pRZXKWRV2dazO-UQ@mail.gmail.com. I'll try to summarize the status and
explain my understanding of the choices first.
[1]: /messages/by-id/CAMGcDxeHBaXCz12LdfEmyJdghbms_dtC26pRZXKWRV2dazO-UQ@mail.gmail.com
/messages/by-id/CAMGcDxeHBaXCz12LdfEmyJdghbms_dtC26pRZXKWRV2dazO-UQ@mail.gmail.com
There were multiple ideas about how to deal with aborted transactions,
but I we eventually found various issues in all of them except for two -
interlocking decoding and aborts, and modifying the rules so that
aborted transactions are considered to be running while being decoded.
This patch uses the first approach, i.e. interlock. It has a couple of
disadvantages:
a) The abort may need to wait for decoding workers for a while.
This is annoying, but aborts are generally rare. And for systems with
many concurrent short transactions (where even tiny delays would matter)
it's unlikely the decoding workers will already start decoding the
aborted transaction.
b) output plugins need to call lock/unlock explicitly from the callbacks
Technically, we could wrap the whole callback in a lock/unlock, but that
would needlessly increase the amount of time spent holding the lock,
making the previous point much worse. As the callbacks are expected to
do network I/O etc. the amount of time could be quite significant.
The main disadvantage is of course that it's likely much less invasive
than tweaking which transactions are seen as running. So I think taking
this approach is a sensible choice at this point.
Now, about the interlock implementation - I see you've reused the "lock
group" concept from parallel query. That may make sense, unfortunately
there's about no documentation explaining how it works, what is the
"protocol" etc. There is fairly extensive documentation for "lock
groups" in src/backend/storage/lmgr/README, but while the "decoding
group" code is inspired by it, the code is actually very different.
Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
and you'll see what I mean.
So I think the first thing we need to do is add proper documentation
(possibly into the same README), explaining how the decode groups work,
how the decodeAbortPending works, etc.
Also, some function names seem a bit misleading. For example in the lock
group "BecomeLockGroupLeader" means (make the current process a group
leader), but apparently "BecomeDecodeGroupLeader" means "find the
process handling XID and make it a leader". But perhaps I got that
entirely wrong.
Of course LogicalLockTransaction and LogicalUnlockTransaction, should
have proper comments, which is particularly important as it's part of
the public API.
BTW, do we need to do any of this with (wal_level < logical)? I don't
see any quick bail-out in any of the functions in this case, but it
seems like a fairly obvious optimization.
Similarly, can't the logical workers indicate that they need to decode
2PC transactions (or in-progress transactions in general) in some way?
If we knew there are no such workers, that would also allow ignoring the
interlock, no?
Another thing is that I'm yet to see any performance tests. While we do
believe it will work fine, it's based on a number of assumptions:
a) aborts are rare
b) it has no measurable impact on commit
I think we need to verify this by actually measuring the impact on a
bunch of workloads. In particular, I think we need to test
i) impact on commit-only workloads
ii) impact on worst-case scenario
I'm not sure how (ii) would look like, considering the patch only deals
with decoding 2PC transactions, which have significant overhead on their
own - so I'm afraid the impact on "regular transactions" might be much
worse, once we add support for that.
decoding 2PC transactions
=========================
Now, the main topic of the patch. Overall the changes make sense, I
think - it modifies about the same places I touched in the streaming
patch, in similar ways.
The following comments are mostly in random order:
1) test_decoding.c
------------------
The "filter" functions do not follow the naming convention, so I suggest
to rename them like this:
- pg_filter_decode_txn -> pg_decode_filter_txn
- pg_filter_prepare -> pg_decode_filter_prepare_txn
or something like that. Also, looking at those functions (and those same
callbacks in the pgoutput plugin) I wonder if we really need to make
them part of the output plugin API.
I mean, AFAICS their only purpose is to filter 2PC transactions, but I
don't quite see why implementing those checks should be responsibility
of the plugin? I suppose it was done to make test_decoding customizable
(i.e. allow enabling/disabling of decoding 2PC as needed), right?
In that case I suggest make it configurable by plugin-level flags (I see
LogicalDecodingContext already has a enable_twophase), and moving the
checks to a function that is not part of the plugin API. Of course, in
that case the flag needs to be customizable from plugin options, not
just "Does the plugin have all the callbacks?".
The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem
a bit inconsistently named too (why decode vs. decoding?).
2) regression tests
-------------------
I really dislike the use of \set to run the same query repeatedly. It
makes analysis of regression failures even more tedious than it already
is. I'd just copy the query to all the places.
3) worker.c
-----------
The comment in apply_handle_rollback_prepared_txn says this:
/*
* During logical decoding, on the apply side, it's possible that a
* prepared transaction got aborted while decoding. In that case, we
* stop the decoding and abort the transaction immediately. However
* the ROLLBACK prepared processing still reaches the subscriber. In
* that case it's ok to have a missing gid
*/
if (LookupGXact(commit_data->gid)) { ... }
But is it safe to assume it never happens due to an error? In other
words, is there a way to decide that the GID really aborted? Or, why
should the provider sent the rollback at all - surely it could know if
the transaction/GID was sent to subscriber or not, right?
4) twophase.c
-------------
I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's
meant to identify 2PC files, then why not to keep the value. And if we
really need to modify it, why not to use another random number? By only
adding 1 to the current one, it makes it look like a random bit flip.
5) decode.c
-----------
The changes in DecodeCommit need proper comments.
In DecodeAbort, the "if" includes this condition:
ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)
which essentially means ROLLBACK PREPARED is translated into "is the
transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort
instead, and make the ReorderBufferTxnIsPrepared an Assert?
6) logical.c
------------
I see StartupDecodingContext does this:
twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
(ctx->callbacks.commit_prepared_cb != NULL) +
(ctx->callbacks.abort_prepared_cb != NULL);
It seems a bit strange to make arithmetics on bools, I guess. In any
case, I think this should be an ERROR and not a WARNING:
if (twophase_callbacks != 3 && twophase_callbacks != 0)
ereport(WARNING,
(errmsg("Output plugin registered only %d twophase callbacks. "
"Twophase transactions will be decoded at commit time.",
twophase_callbacks)));
A plugin that implements only a subset of the callbacks seems outright
broken, so let's just fail.
7) proto.c / worker.c
---------------------
Until now, the 'action' (essentially the first byte of each message)
clearly identified what the message does. So 'C' -> commit, 'I' ->
insert, 'D' -> delete etc. This also means the "handle" methods were
inherently simple, because each handled exactly one particular action
and nothing else.
You've expanded the protocol in a way that suddenly 'C' means either
COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT
PREPARED. I don't think that's how the protocol should be extended - if
anything, it's damn confusing and unlike the existing code. You should
define new action, and keep the handlers in worker.c simple.
Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase.
8) reorderbuffer.h/c
--------------------
Similarly, I wonder why you replaced the ReorderBuffer boolean flags
(is_known_subxact, has_catalog_changes) with a bitmask? I find it way
more difficult to read (which is subjective, of course) but it also
makes IDEs dumber (suddenly they can't offer you field names).
Surely it wasn't done to save space, because by using an "int" you've
saved just 4B (there are 8 flags right now, so it'd need 8 bytes with
plain bool flags) on a structure that is already ~200B.
And you the added gid[GIDSIZE] to it, making it 400B for *all*
transactions and subtransactions (not just 2PC). Not to mention that the
GID is usually much shorter than the 200B.
So I suggest to use just a simple (char *) pointer for the GID, keeping
it NULL for most transactions, and switching back to plain bool flags.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 March 2018 at 16:37, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
I will re-submit with "git format-patch" soon.
PFA, patches in "format-patch" format.
This patch set also includes changes in the test_decoding plugin along
with an additional savepoint related test case that was pointed out on
this thread, upstream.
Reviewing 0001-Cleaning-up-and-addition-of-new-flags-in-ReorderBuff.patch
Change from is_known_as_subxact to rbtxn_is_subxact
loses some meaning, since rbtxn entries with this flag set false might
still be subxacts, we just don't know yet.
rbtxn_is_serialized refers to RBTXN_SERIALIZED
so flag name should be RBTXN_IS_SERIALIZED so it matches
Otherwise looks OK to commit
Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco
Looks fine, reworked patch attached
* added changes to xact.h from patch 4 so that this is a whole,
committable patch
* added comments to make abort and commit structs look same
Attached patch is proposed for a separate, early commit as part of this
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
logging-GID-in-commit-abort-WAL.v2.patchapplication/octet-stream; name=logging-GID-in-commit-abort-WAL.v2.patchDownload
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..d6e4b7980f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -898,7 +896,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +912,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1065,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1076,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1123,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1308,44 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->nrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->xnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1435,11 +1498,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1816,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2230,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2259,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2321,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2345,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2376,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2426,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5d1b9027cf..04cec9b2f0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5235,7 +5236,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5247,6 +5249,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5309,6 +5312,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5359,7 +5369,16 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5380,15 +5399,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5424,6 +5447,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5438,6 +5486,10 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5455,7 +5507,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
return XLogInsert(RM_XACT_ID, info);
}
@@ -5778,7 +5845,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..f05cde202f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
@@ -54,7 +57,7 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
extern void FinishPreparedTransaction(const char *gid, bool isCommit);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 87ae2cd4df..a46396f2d9 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -286,7 +294,6 @@ typedef struct xl_xact_abort
typedef struct xl_xact_parsed_commit
{
TimestampTz xact_time;
-
uint32 xinfo;
Oid dbId; /* MyDatabaseId */
@@ -302,16 +309,24 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+ int nabortrels; /* only for 2PC */
+ RelFileNode *abortnodes; /* only for 2PC */
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
TimestampTz xact_time;
uint32 xinfo;
+ Oid dbId; /* MyDatabaseId */
+ Oid tsId; /* MyDatabaseTableSpace */
+
int nsubxacts;
TransactionId *subxacts;
@@ -319,6 +334,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +405,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid,
+ const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
On 23 March 2018 at 15:26, Simon Riggs <simon@2ndquadrant.com> wrote:
Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco
Looks fine, reworked patch attached
* added changes to xact.h from patch 4 so that this is a whole,
committable patch
* added comments to make abort and commit structs look sameAttached patch is proposed for a separate, early commit as part of this
Looking to commit "logging GID" patch today, if no further objections.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-03-27 10:19:37 +0100, Simon Riggs wrote:
On 23 March 2018 at 15:26, Simon Riggs <simon@2ndquadrant.com> wrote:
Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco
Looks fine, reworked patch attached
* added changes to xact.h from patch 4 so that this is a whole,
committable patch
* added comments to make abort and commit structs look sameAttached patch is proposed for a separate, early commit as part of this
Looking to commit "logging GID" patch today, if no further objections.
None here.
Greetings,
Andres Freund
Hi Tomas,
Now, about the interlock implementation - I see you've reused the "lock
group" concept from parallel query. That may make sense, unfortunately
there's about no documentation explaining how it works, what is the
"protocol" etc. There is fairly extensive documentation for "lock
groups" in src/backend/storage/lmgr/README, but while the "decoding
group" code is inspired by it, the code is actually very different.
Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
and you'll see what I mean.So I think the first thing we need to do is add proper documentation
(possibly into the same README), explaining how the decode groups work,
how the decodeAbortPending works, etc.
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.
BTW, do we need to do any of this with (wal_level < logical)? I don't
see any quick bail-out in any of the functions in this case, but it
seems like a fairly obvious optimization.
The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs
will be from inside plugins or the reorderbuffer code paths. Those
will get invoked only in the wal_level logical case, hence I did not
add further checks.
Similarly, can't the logical workers indicate that they need to decode
2PC transactions (or in-progress transactions in general) in some way?
If we knew there are no such workers, that would also allow ignoring the
interlock, no?
These APIs check if the transaction is already committed and cache
that information for further calls, so for regular transactions this
becomes a no-op
decoding 2PC transactions
=========================Now, the main topic of the patch. Overall the changes make sense, I
think - it modifies about the same places I touched in the streaming
patch, in similar ways.The following comments are mostly in random order:
1) test_decoding.c
------------------The "filter" functions do not follow the naming convention, so I suggest
to rename them like this:- pg_filter_decode_txn -> pg_decode_filter_txn
- pg_filter_prepare -> pg_decode_filter_prepare_txnor something like that. Also, looking at those functions (and those same
callbacks in the pgoutput plugin) I wonder if we really need to make
them part of the output plugin API.I mean, AFAICS their only purpose is to filter 2PC transactions, but I
don't quite see why implementing those checks should be responsibility
of the plugin? I suppose it was done to make test_decoding customizable
(i.e. allow enabling/disabling of decoding 2PC as needed), right?In that case I suggest make it configurable by plugin-level flags (I see
LogicalDecodingContext already has a enable_twophase), and moving the
checks to a function that is not part of the plugin API. Of course, in
that case the flag needs to be customizable from plugin options, not
just "Does the plugin have all the callbacks?".
The idea behind exposing the API is to allow the plugins to have
selective control over specific 2PC actions. They might want to decode
certain 2PC but not some others. By providing this callback, they can
do that selectively.
The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem
a bit inconsistently named too (why decode vs. decoding?).
This has been removed in the latest patches altogether. Maybe you were
referring to an older patch.
2) regression tests
-------------------I really dislike the use of \set to run the same query repeatedly. It
makes analysis of regression failures even more tedious than it already
is. I'd just copy the query to all the places.
They are long-winded queries and IMO made the test file look too
cluttered and verbose..
3) worker.c
-----------The comment in apply_handle_rollback_prepared_txn says this:
/*
* During logical decoding, on the apply side, it's possible that a
* prepared transaction got aborted while decoding. In that case, we
* stop the decoding and abort the transaction immediately. However
* the ROLLBACK prepared processing still reaches the subscriber. In
* that case it's ok to have a missing gid
*/
if (LookupGXact(commit_data->gid)) { ... }But is it safe to assume it never happens due to an error? In other
words, is there a way to decide that the GID really aborted? Or, why
should the provider sent the rollback at all - surely it could know if
the transaction/GID was sent to subscriber or not, right?
Since we decode in commit WAL order, when we reach the ROLLBACK
PREPARED wal record, we cannot be sure that we did infact abort the
decoding mid ways because of this concurrent rollback. It's possible
that this rollback comes much much later as well when all decoding
backends have successfully prepared it on the subscribers already.
4) twophase.c
-------------I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's
meant to identify 2PC files, then why not to keep the value. And if we
really need to modify it, why not to use another random number? By only
adding 1 to the current one, it makes it look like a random bit flip.
We could retain the existing magic here.
5) decode.c
-----------The changes in DecodeCommit need proper comments.
In DecodeAbort, the "if" includes this condition:
ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)
which essentially means ROLLBACK PREPARED is translated into "is the
transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort
instead, and make the ReorderBufferTxnIsPrepared an Assert?
This again goes back to the earlier callback in which want the
pg_decode_filter_prepare_txn to selectively decide to filter out or
decode some of the 2PC transactions. If we allow that callback, then
we need to consult ReorderBufferTxnIsPrepared to get the same response
for these 2PC transactions.
6) logical.c
------------I see StartupDecodingContext does this:
twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
(ctx->callbacks.commit_prepared_cb != NULL) +
(ctx->callbacks.abort_prepared_cb != NULL);It seems a bit strange to make arithmetics on bools, I guess. In any
case, I think this should be an ERROR and not a WARNING:if (twophase_callbacks != 3 && twophase_callbacks != 0)
ereport(WARNING,
(errmsg("Output plugin registered only %d twophase callbacks. "
"Twophase transactions will be decoded at commit time.",
twophase_callbacks)));A plugin that implements only a subset of the callbacks seems outright
broken, so let's just fail.
Ok, done.
7) proto.c / worker.c
---------------------Until now, the 'action' (essentially the first byte of each message)
clearly identified what the message does. So 'C' -> commit, 'I' ->
insert, 'D' -> delete etc. This also means the "handle" methods were
inherently simple, because each handled exactly one particular action
and nothing else.You've expanded the protocol in a way that suddenly 'C' means either
COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT
PREPARED. I don't think that's how the protocol should be extended - if
anything, it's damn confusing and unlike the existing code. You should
define new action, and keep the handlers in worker.c simple.
I thought this grouped regular commit and 2PC transactions properly.
Can look at this again if this style is not favored.
Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase.
Ok, increased it to 2.
PFA, latest patch set. The ReorderBufferCommit() handling has been
further simplified now without worrying too much about optimizing for
abort handling at various steps.
This also contains an additional/optional 7th patch which has a test
case to solely demonstrate the concurrent abort/logical decoding
interlocking. It uses the delay using sleep logic while holding
LogicalTransactionLock. This additional patch might not be considered
for commit as the delay based approach is prone to failures on slower
machines.
Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
is the exact patch that you had posted for an earlier commit.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 6e2b6a874c3a563044a99e3944d33ec65ba711df Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 27 Mar 2018 19:53:03 +0530
Subject: [PATCH 1/7] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5ffe638b19..f10d1c2289 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -639,7 +639,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -657,7 +657,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -691,9 +691,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -754,9 +754,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -865,7 +865,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -894,7 +894,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1060,7 +1060,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1099,7 +1099,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1114,7 +1114,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1131,7 +1131,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1711,7 +1711,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1957,7 +1957,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1974,7 +1974,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2114,7 +2114,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.14.3 (Apple Git-98)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchDownload
From f4ac105fbb654d625ab93e50021cf6b577cd23ee Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 13:59:26 +0530
Subject: [PATCH 2/7] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately
---
src/backend/replication/logical/logical.c | 161 +++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++++
src/backend/storage/lmgr/README | 39 +++++
src/backend/storage/lmgr/proc.c | 282 ++++++++++++++++++++++++++++++
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 ++
src/include/storage/proc.h | 25 +++
src/include/storage/procarray.h | 1 +
8 files changed, 564 insertions(+)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..a9b043be88 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,164 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return true;
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Find the PROC that is handling this XID and add ourself as a
+ * decodeGroupMember
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, rbtxn_prepared(txn));
+
+ /*
+ * If decodeGroupLeader is NULL, then the only possibility
+ * is that the transaction completed and went away
+ */
+ if (proc == NULL)
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+
+ /* Add ourself as a decodeGroupMember */
+ if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us. Check if the transaction is still around
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ if (MyProc->decodeGroupLeader)
+ {
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+ else
+ return false;
+
+ return ok;
+}
+
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * Prepared transactions and uncommitted transactions
+ * that have modified catalogs need to interlock with
+ * concurrent rollback to ensure that there are no
+ * issues while decoding
+ */
+
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Is it a prepared txn? Similar checks for uncommitted
+ * transactions when we start supporting them
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /* check cached status */
+ if (rbtxn_commit(txn))
+ return;
+ if (rbtxn_rollback(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..9742a348cf 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,45 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+We use an infrastructure which is very similar to the above group locking
+of parallel processes to create a group of backends that are performing
+logical decoding of an uncommitted or a prepared transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+uncommitted or prepared transaction then it finds out the PGPROC
+structure which is associated with that transaction id and makes that
+PGPROC structure as its decodeGroupLeader. The decodeGroupMembers field
+is only used in the leader; it is a list of the member PGPROCs of the
+decode group (the leader and all backends decoding this transaction id).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the
+LogicalLockTransaction API. It resets the same via the
+LogicalUnlockTransaction API. Meanwhile, if the transaction id of this
+uncommitted or prepared transaction decides to abort then the PGPROC
+structure corresponding to it sets decodeAbortPending on itself and also
+on all the decodeGroupMembers entries. The decodeGroupMembers entries
+stop decoding of this aborted transaction and exit. When all the
+decoding backends have exited then the aborting transaction goes ahead
+with its regular processing.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..26d35c7807 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -1887,3 +1904,268 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * BecomeDecodeGroupLeader - designate process as decode group leader
+ *
+ * Once this function has returned, other processes can join the decode group
+ * by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+
+ /*
+ * This proc will become decodeGroupLeader if it's
+ * not already
+ */
+ if (proc && proc->decodeGroupLeader != proc)
+ {
+ volatile PGXACT *pgxact;
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+ if (proc->pid == pid && pgxact->xid == xid)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader != proc)
+ {
+ /* We had better not be a follower. */
+ Assert(proc->decodeGroupLeader == NULL);
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Acquire lock
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * Remove a decodeGroupMember from the decodeGroupMembership of
+ * decodeGroupLeader
+ * Assumes that the caller is holding appropriate lock
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * Indicate to all decodeGroupMembers that this transaction is
+ * going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning
+ * from here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been
+ * removed from the procArray.
+ *
+ * If the transaction is committing, it's ok for the
+ * decoders to continue merrily. When it tries to lock this
+ * proc, it won't find it and check for transaction status
+ * and cache the commit status for future calls in
+ * LogicalLockTransaction
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ /* mark ourself as aborting */
+ if (!isCommit)
+ leader->decodeAbortPending = true;
+
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+ /* mark the proc to indicate abort is pending */
+ if (proc == leader)
+ continue;
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..fdfc582874 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -326,5 +346,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.14.3 (Apple Git-98)
0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patchapplication/octet-stream; name=0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patchDownload
From 11b627731b3ada50c2d7558b1595e1289b545f9c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 14:11:55 +0530
Subject: [PATCH 3/7] Add GID and replica origin to two-phase commit/abort WAL
records
Including GID in commit/abort WAL records of two-phase transactions
allows logical decoding to forward the same GID to subscribers across
restarts. This is important as the GIDs may encode information for
external transaction manager, or other application-specific data.
Replica origin enables tracking progress for two-phase transactions,
to avoid having to resend PREPARE TRANSACTION from the upstream.
---
src/backend/access/rmgrdesc/xactdesc.c | 39 ++++++++++++
src/backend/access/transam/twophase.c | 105 ++++++++++++++++++++++++++++-----
src/backend/access/transam/xact.c | 78 ++++++++++++++++++++++--
src/include/access/twophase.h | 5 +-
src/include/access/xact.h | 27 ++++++++-
5 files changed, 230 insertions(+), 24 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index e5eef9ea43..b3e2fc3036 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -102,6 +102,14 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -139,6 +147,16 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += sizeof(xl_xact_xinfo);
}
+ if (parsed->xinfo & XACT_XINFO_HAS_DBINFO)
+ {
+ xl_xact_dbinfo *xl_dbinfo = (xl_xact_dbinfo *) data;
+
+ parsed->dbId = xl_dbinfo->dbId;
+ parsed->tsId = xl_dbinfo->tsId;
+
+ data += sizeof(xl_xact_dbinfo);
+ }
+
if (parsed->xinfo & XACT_XINFO_HAS_SUBXACTS)
{
xl_xact_subxacts *xl_subxacts = (xl_xact_subxacts *) data;
@@ -168,6 +186,27 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
parsed->twophase_xid = xl_twophase->xid;
data += sizeof(xl_xact_twophase);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_GID)
+ {
+ int gidlen;
+ strcpy(parsed->twophase_gid, data);
+ gidlen = strlen(parsed->twophase_gid) + 1;
+ data += MAXALIGN(gidlen);
+ }
+ }
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ xl_xact_origin xl_origin;
+
+ /* we're only guaranteed 4 byte alignment, so copy onto stack */
+ memcpy(&xl_origin, data, sizeof(xl_origin));
+
+ parsed->origin_lsn = xl_origin.origin_lsn;
+ parsed->origin_timestamp = xl_origin.origin_timestamp;
+
+ data += sizeof(xl_xact_origin);
}
}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c4881b..d6e4b7980f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -144,11 +144,7 @@ int max_prepared_xacts = 0;
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
- *
- * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
- * specified in TwoPhaseFileHeader.
*/
-#define GIDSIZE 200
typedef struct GlobalTransactionData
{
@@ -211,12 +207,14 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval);
+ bool initfileinval,
+ const char *gid);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels);
+ RelFileNode *rels,
+ const char *gid);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
@@ -898,7 +896,7 @@ TwoPhaseGetDummyProc(TransactionId xid)
/*
* Header for a 2PC state file
*/
-#define TWOPHASE_MAGIC 0x57F94533 /* format identifier */
+#define TWOPHASE_MAGIC 0x57F94534 /* format identifier */
typedef struct TwoPhaseFileHeader
{
@@ -914,6 +912,8 @@ typedef struct TwoPhaseFileHeader
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
+ XLogRecPtr origin_lsn; /* lsn of this record at origin node */
+ TimestampTz origin_timestamp; /* time of prepare at origin node */
} TwoPhaseFileHeader;
/*
@@ -1065,6 +1065,7 @@ EndPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader *hdr;
StateFileChunk *record;
+ bool replorigin;
/* Add the end sentinel to the list of 2PC records */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
@@ -1075,6 +1076,21 @@ EndPrepare(GlobalTransaction gxact)
Assert(hdr->magic == TWOPHASE_MAGIC);
hdr->total_len = records.total_len + sizeof(pg_crc32c);
+ replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+ replorigin_session_origin != DoNotReplicateId);
+
+ if (replorigin)
+ {
+ Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr);
+ hdr->origin_lsn = replorigin_session_origin_lsn;
+ hdr->origin_timestamp = replorigin_session_origin_timestamp;
+ }
+ else
+ {
+ hdr->origin_lsn = InvalidXLogRecPtr;
+ hdr->origin_timestamp = 0;
+ }
+
/*
* If the data size exceeds MaxAllocSize, we won't be able to read it in
* ReadTwoPhaseFile. Check for that now, rather than fail in the case
@@ -1107,7 +1123,16 @@ EndPrepare(GlobalTransaction gxact)
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
+
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
+
+ if (replorigin)
+ /* Move LSNs forward for this replication origin */
+ replorigin_session_advance(replorigin_session_origin_lsn,
+ gxact->prepare_end_lsn);
+
XLogFlush(gxact->prepare_end_lsn);
/* If we crash now, we have prepared: WAL replay will fix things */
@@ -1283,6 +1308,44 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
return buf;
}
+/*
+ * ParsePrepareRecord
+ */
+void
+ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
+{
+ TwoPhaseFileHeader *hdr;
+ char *bufptr;
+
+ hdr = (TwoPhaseFileHeader *) xlrec;
+ bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader));
+
+ parsed->origin_lsn = hdr->origin_lsn;
+ parsed->origin_timestamp = hdr->origin_timestamp;
+ parsed->twophase_xid = hdr->xid;
+ parsed->dbId = hdr->database;
+ parsed->nsubxacts = hdr->nsubxacts;
+ parsed->nrels = hdr->ncommitrels;
+ parsed->nabortrels = hdr->nabortrels;
+ parsed->nmsgs = hdr->ninvalmsgs;
+
+ strncpy(parsed->twophase_gid, bufptr, hdr->gidlen);
+ bufptr += MAXALIGN(hdr->gidlen);
+
+ parsed->subxacts = (TransactionId *) bufptr;
+ bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
+
+ parsed->xnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
+
+ parsed->abortnodes = (RelFileNode *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+
+ parsed->msgs = (SharedInvalidationMessage *) bufptr;
+ bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+}
+
+
/*
* Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1435,11 +1498,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
- hdr->initfileinval);
+ hdr->initfileinval, gid);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels);
+ hdr->nabortrels, abortrels,
+ gid);
ProcArrayRemove(proc, latestXid);
@@ -1752,7 +1816,8 @@ restoreTwoPhaseData(void)
if (buf == NULL)
continue;
- PrepareRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+ PrepareRedoAdd(buf, InvalidXLogRecPtr,
+ InvalidXLogRecPtr, InvalidRepOriginId);
}
}
LWLockRelease(TwoPhaseStateLock);
@@ -2165,7 +2230,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
- bool initfileinval)
+ bool initfileinval,
+ const char *gid)
{
XLogRecPtr recptr;
TimestampTz committs = GetCurrentTimestamp();
@@ -2193,7 +2259,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
ninvalmsgs, invalmsgs,
initfileinval, false,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
if (replorigin)
@@ -2255,7 +2321,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileNode *rels)
+ RelFileNode *rels,
+ const char *gid)
{
XLogRecPtr recptr;
@@ -2278,7 +2345,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
nchildren, children,
nrels, rels,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
- xid);
+ xid, gid);
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
@@ -2309,7 +2376,8 @@ RecordTransactionAbortPrepared(TransactionId xid,
* data, the entry is marked as located on disk.
*/
void
-PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
+ XLogRecPtr end_lsn, RepOriginId origin_id)
{
TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *) buf;
char *bufptr;
@@ -2358,6 +2426,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
+ if (origin_id != InvalidRepOriginId)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, hdr->origin_lsn, end_lsn,
+ false /* backward */ , false /* WAL */ );
+ }
+
elog(DEBUG2, "added 2PC data in shared memory for transaction %u", gxact->xid);
}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5d1b9027cf..04cec9b2f0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1227,7 +1227,7 @@ RecordTransactionCommit(void)
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
- InvalidTransactionId /* plain commit */ );
+ InvalidTransactionId, NULL /* plain commit */ );
if (replorigin)
/* Move LSNs forward for this replication origin */
@@ -1579,7 +1579,8 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
nrels, rels,
- MyXactFlags, InvalidTransactionId);
+ MyXactFlags, InvalidTransactionId,
+ NULL);
/*
* Report the latest async abort LSN, so that the WAL writer knows to
@@ -5235,7 +5236,8 @@ XactLogCommitRecord(TimestampTz commit_time,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_commit xlrec;
xl_xact_xinfo xl_xinfo;
@@ -5247,6 +5249,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5309,6 +5312,13 @@ XactLogCommitRecord(TimestampTz commit_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
}
/* dump transaction origin information */
@@ -5359,7 +5369,16 @@ XactLogCommitRecord(TimestampTz commit_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
@@ -5380,15 +5399,19 @@ XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid)
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid)
{
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
xl_xact_relfilenodes xl_relfilenodes;
xl_xact_twophase xl_twophase;
+ xl_xact_dbinfo xl_dbinfo;
+ xl_xact_origin xl_origin;
uint8 info;
+ int gidlen = 0;
Assert(CritSectionCount > 0);
@@ -5424,6 +5447,31 @@ XactLogAbortRecord(TimestampTz abort_time,
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_TWOPHASE;
xl_twophase.xid = twophase_xid;
+ Assert(twophase_gid != NULL);
+
+ if (XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_GID;
+ gidlen = strlen(twophase_gid) + 1; /* include '\0' */
+ }
+ }
+
+ if (TransactionIdIsValid(twophase_xid) && XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_DBINFO;
+ xl_dbinfo.dbId = MyDatabaseId;
+ xl_dbinfo.tsId = MyDatabaseTableSpace;
+ }
+
+ /* dump transaction origin information only for abort prepared */
+ if ( (replorigin_session_origin != InvalidRepOriginId) &&
+ TransactionIdIsValid(twophase_xid) &&
+ XLogLogicalInfoActive())
+ {
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
+
+ xl_origin.origin_lsn = replorigin_session_origin_lsn;
+ xl_origin.origin_timestamp = replorigin_session_origin_timestamp;
}
if (xl_xinfo.xinfo != 0)
@@ -5438,6 +5486,10 @@ XactLogAbortRecord(TimestampTz abort_time,
if (xl_xinfo.xinfo != 0)
XLogRegisterData((char *) (&xl_xinfo), sizeof(xl_xinfo));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_DBINFO)
+ XLogRegisterData((char *) (&xl_dbinfo), sizeof(xl_dbinfo));
+
+
if (xl_xinfo.xinfo & XACT_XINFO_HAS_SUBXACTS)
{
XLogRegisterData((char *) (&xl_subxacts),
@@ -5455,7 +5507,22 @@ XactLogAbortRecord(TimestampTz abort_time,
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_TWOPHASE)
+ {
XLogRegisterData((char *) (&xl_twophase), sizeof(xl_xact_twophase));
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_GID)
+ {
+ static const char zeroes[MAXIMUM_ALIGNOF] = { 0 };
+ XLogRegisterData((char*) twophase_gid, gidlen);
+ if (MAXALIGN(gidlen) != gidlen)
+ XLogRegisterData((char*) zeroes, MAXALIGN(gidlen) - gidlen);
+ }
+ }
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_ORIGIN)
+ XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
+
+ if (TransactionIdIsValid(twophase_xid))
+ XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
return XLogInsert(RM_XACT_ID, info);
}
@@ -5778,7 +5845,8 @@ xact_redo(XLogReaderState *record)
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
PrepareRedoAdd(XLogRecGetData(record),
record->ReadRecPtr,
- record->EndRecPtr);
+ record->EndRecPtr,
+ XLogRecGetOrigin(record));
LWLockRelease(TwoPhaseStateLock);
}
else if (info == XLOG_XACT_ASSIGNMENT)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470811..f05cde202f 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -15,6 +15,7 @@
#define TWOPHASE_H
#include "access/xlogdefs.h"
+#include "access/xact.h"
#include "datatype/timestamp.h"
#include "storage/lock.h"
@@ -46,6 +47,8 @@ extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
+extern void ParsePrepareRecord(uint8 info, char *xlrec,
+ xl_xact_parsed_prepare *parsed);
extern void StandbyRecoverPreparedTransactions(void);
extern void RecoverPreparedTransactions(void);
@@ -54,7 +57,7 @@ extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
extern void FinishPreparedTransaction(const char *gid, bool isCommit);
extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
- XLogRecPtr end_lsn);
+ XLogRecPtr end_lsn, RepOriginId origin_id);
extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
extern void restoreTwoPhaseData(void);
#endif /* TWOPHASE_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 87ae2cd4df..a46396f2d9 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -21,6 +21,13 @@
#include "storage/sinval.h"
#include "utils/datetime.h"
+/*
+ * Maximum size of Global Transaction ID (including '\0').
+ *
+ * Note that the max value of GIDSIZE must fit in the uint16 gidlen,
+ * specified in TwoPhaseFileHeader.
+ */
+#define GIDSIZE 200
/*
* Xact isolation levels
@@ -156,6 +163,7 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
#define XACT_XINFO_HAS_TWOPHASE (1U << 4)
#define XACT_XINFO_HAS_ORIGIN (1U << 5)
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
+#define XACT_XINFO_HAS_GID (1U << 7)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -286,7 +294,6 @@ typedef struct xl_xact_abort
typedef struct xl_xact_parsed_commit
{
TimestampTz xact_time;
-
uint32 xinfo;
Oid dbId; /* MyDatabaseId */
@@ -302,16 +309,24 @@ typedef struct xl_xact_parsed_commit
SharedInvalidationMessage *msgs;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+ int nabortrels; /* only for 2PC */
+ RelFileNode *abortnodes; /* only for 2PC */
XLogRecPtr origin_lsn;
TimestampTz origin_timestamp;
} xl_xact_parsed_commit;
+typedef xl_xact_parsed_commit xl_xact_parsed_prepare;
+
typedef struct xl_xact_parsed_abort
{
TimestampTz xact_time;
uint32 xinfo;
+ Oid dbId; /* MyDatabaseId */
+ Oid tsId; /* MyDatabaseTableSpace */
+
int nsubxacts;
TransactionId *subxacts;
@@ -319,6 +334,10 @@ typedef struct xl_xact_parsed_abort
RelFileNode *xnodes;
TransactionId twophase_xid; /* only for 2PC */
+ char twophase_gid[GIDSIZE]; /* only for 2PC */
+
+ XLogRecPtr origin_lsn;
+ TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
@@ -386,12 +405,14 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval, bool forceSync,
int xactflags,
- TransactionId twophase_xid);
+ TransactionId twophase_xid,
+ const char *twophase_gid);
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
- int xactflags, TransactionId twophase_xid);
+ int xactflags, TransactionId twophase_xid,
+ const char *twophase_gid);
extern void xact_redo(XLogReaderState *record);
/* xactdesc.c */
--
2.14.3 (Apple Git-98)
0004-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0004-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From de9be7d1be79713123955797af20c6b27da9bbf0 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 18:25:28 +0530
Subject: [PATCH 4/7] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
---
src/backend/access/transam/twophase.c | 5 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++++--
src/backend/replication/logical/logical.c | 193 ++++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 193 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 ++++++
src/include/replication/reorderbuffer.h | 54 +++++++
7 files changed, 614 insertions(+), 34 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..f3091af385 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1506,6 +1506,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a9b043be88..6e3f8625d1 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,42 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * If twophase is not enabled, skip decoding at PREPARE time
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f10d1c2289..66c02e5af4 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1280,25 +1280,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1391,8 +1384,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1584,8 +1583,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1612,7 +1629,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1646,6 +1668,137 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1714,7 +1867,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2628,9 +2781,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..1dedf5cc42 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.14.3 (Apple Git-98)
0005-pgoutput-output-plugin-support-for-logical-decoding-.patchapplication/octet-stream; name=0005-pgoutput-output-plugin-support-for-logical-decoding-.patchDownload
From dfca1570ec58b13bc579d39a194da34648b9a1fb Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 19:08:33 +0530
Subject: [PATCH 5/7] pgoutput output plugin support for logical decoding of
2PC.
Includes documentation changes and test cases.
---
doc/src/sgml/logicaldecoding.sgml | 134 ++++++++++++++++++-
src/backend/access/transam/twophase.c | 32 +++++
src/backend/replication/logical/logical.c | 11 +-
src/backend/replication/logical/proto.c | 90 ++++++++++++-
src/backend/replication/logical/reorderbuffer.c | 2 +
src/backend/replication/logical/worker.c | 147 ++++++++++++++++++++-
src/backend/replication/pgoutput/pgoutput.c | 72 ++++++++++-
src/include/access/twophase.h | 1 +
src/include/replication/logicalproto.h | 39 +++++-
src/test/subscription/t/010_twophase.pl | 163 ++++++++++++++++++++++++
10 files changed, 678 insertions(+), 13 deletions(-)
create mode 100644 src/test/subscription/t/010_twophase.pl
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..78905caa7f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,12 +643,30 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. The <function>change_cb</function> call should invoke
+ <function>LogicalLockTransaction</function> function before such access of
+ system or user catalog tables. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ function interlocks the decoding activity with simultaneous rollback by
+ another backend of this very same transaction. The
+ <function>change_cb</function> should invoke
+ <function>LogicalUnlockTransaction</function> function immediately after
+ the catalog tables access.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
Relation relation,
ReorderBufferChange *change);
+</programlisting>
+ Here's an example of the use of <function>LogicalLockTransaction</function>
+ and <function>LogicalUnlockTransaction</function> in an output plugin:
+<programlisting>
+ if (!LogicalLockTransaction(txn))
+ return;
+ relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
</programlisting>
The <parameter>ctx</parameter> and <parameter>txn</parameter> parameters
have the same contents as for the <function>begin_cb</function>
@@ -619,6 +716,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f3091af385..e2db0ebf77 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -549,6 +549,38 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
}
+/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is
+ * around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs */
+ if (!gxact->valid)
+ continue;
+ if (strcmp(gxact->gid, gid) != 0)
+ continue;
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return false;
+}
+
/*
* LockGXact
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e3f8625d1..8025d999fa 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -901,11 +901,20 @@ filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
bool ret;
/*
- * If twophase is not enabled, skip decoding at PREPARE time
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
*/
if (!ctx->enable_twophase)
return true;
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "filter_prepare";
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..ac6aebde0a 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -70,12 +70,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
*/
void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, bool is_commit)
{
uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ if (is_commit)
+ flags |= LOGICALREP_IS_COMMIT;
+ else
+ flags |= LOGICALREP_IS_ABORT;
+
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,16 +91,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
{
- /* read flags (unused for now) */
+ /* read flags */
uint8 flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!CommitFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ flags);
+
+ /* the flag is either commit or abort */
+ commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
@@ -103,6 +112,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
commit_data->committime = pq_getmsgint64(in);
}
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ /*
+ * This should only ever happen for 2PC transactions. In which case we
+ * expect to have a non-empty GID.
+ */
+ Assert(rbtxn_prepared(txn));
+ Assert(strlen(txn->gid) > 0);
+
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags |= LOGICALREP_IS_PREPARE;
+
+ /* Make sure exactly one of the expected flags is set. */
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+ /* read flags */
+ uint8 flags = pq_getmsgbyte(in);
+
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* set the action (reuse the constants used for the flags) */
+ prepare_data->prepare_type = flags;
+
+ /* read fields */
+ prepare_data->prepare_lsn = pq_getmsgint64(in);
+ prepare_data->end_lsn = pq_getmsgint64(in);
+ prepare_data->preparetime = pq_getmsgint64(in);
+
+ /* read gid (copy it into a pre-allocated buffer) */
+ strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
/*
* Write ORIGIN to the output stream.
*/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 66c02e5af4..67faae1b9e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1781,6 +1781,8 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
strcpy(txn->gid, gid);
if (is_commit)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fdace7eea2..56d3239491 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -486,7 +486,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (commit_data.is_commit)
+ CommitTransactionCommand();
+ else
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -506,6 +510,141 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ PrepareTransactionBlock(prepare_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ FinishPreparedTransaction(prepare_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ /*
+ * During logical decoding, on the apply side, it's possible that a
+ * prepared transaction got aborted while decoding. In that case, we stop
+ * the decoding and abort the transaction immediately. However the
+ * ROLLBACK prepared processing still reaches the subscriber. In that case
+ * it's ok to have a missing gid
+ */
+ if (LookupGXact(prepare_data->gid))
+ {
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+ FinishPreparedTransaction(prepare_data->gid, false);
+ CommitTransactionCommand();
+ }
+
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPrepareData prepare_data;
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ switch (prepare_data.prepare_type)
+ {
+ case LOGICALREP_IS_PREPARE:
+ apply_handle_prepare_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_COMMIT_PREPARED:
+ apply_handle_commit_prepared_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_ROLLBACK_PREPARED:
+ apply_handle_rollback_prepared_txn(&prepare_data);
+ break;
+
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected type of prepare message: %d",
+ prepare_data.prepare_type)));
+ }
+}
+
/*
* Handle ORIGIN message.
*
@@ -903,10 +1042,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT/ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index aa9cf5b54e..4f83978c47 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -36,11 +36,19 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -78,6 +86,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -246,7 +260,63 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginUpdateProgress(ctx);
OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_commit(ctx->out, txn, abort_lsn, false);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
OutputPluginWrite(ctx, true);
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f05cde202f..5a4da6efab 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
extern void StartPrepare(GlobalTransaction gxact);
extern void EndPrepare(GlobalTransaction gxact);
extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 116f16f42d..11e3d67223 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -25,7 +25,7 @@
* connect time.
*/
#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_VERSION_NUM 2
/* Tuple coming via logical replication. */
typedef struct LogicalRepTupleData
@@ -68,20 +68,55 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData
{
+ bool is_commit;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
} LogicalRepCommitData;
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+ uint8 prepare_type;
+ XLogRecPtr prepare_lsn;
+ XLogRecPtr end_lsn;
+ TimestampTz preparetime;
+ char gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE 0x01
+#define LOGICALREP_IS_COMMIT_PREPARED 0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_PREPARE) || \
+ (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+ (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn);
+ XLogRecPtr commit_lsn, bool is_commit);
extern void logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepPrepareData * prepare_data);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/010_twophase.pl b/src/test/subscription/t/010_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/010_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
--
2.14.3 (Apple Git-98)
0006-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0006-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 8edbf35aa32a5d43cae15c51008f38b19077e443 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 19:57:58 +0530
Subject: [PATCH 6/7] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 262 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 90 +++++++++-
contrib/test_decoding/test_decoding.c | 147 ++++++++++++++++
3 files changed, 470 insertions(+), 29 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..4086a23f63 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,85 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''enable-twophase'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +92,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+:get_no2pc
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+:get_with2pc
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+:get_with2pc
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+:get_with2pc
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+:get_no2pc
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+:get_with2pc
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+:get_with2pc
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+:get_with2pc
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+:get_no2pc
+ data
+------
+(0 rows)
+
+:get_with2pc
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +286,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..cb32abd740 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,35 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Reused queries
+\set get_no2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'');'
+\set get_with2pc 'SELECT data FROM pg_logical_slot_get_changes(''regression_slot_2pc'', NULL, NULL, ''include-xids'', ''0'', ''skip-empty-xacts'', ''1'', ''enable-twophase'', ''1'');'
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+:get_no2pc
+:get_with2pc
COMMIT PREPARED 'test_prepared#1';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+:get_no2pc
+:get_with2pc
ROLLBACK PREPARED 'test_prepared#2';
+:get_no2pc
+:get_with2pc
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +40,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+:get_no2pc
+:get_with2pc
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+:get_no2pc
+:get_with2pc
COMMIT PREPARED 'test_prepared#3';
+:get_no2pc
+:get_with2pc
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+:get_no2pc
+:get_with2pc
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+:get_no2pc
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+:get_with2pc
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+:get_no2pc
+:get_with2pc
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+:get_with2pc
+:get_with2pc
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- show results. There should be nothing to show
+:get_no2pc
+:get_with2pc
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..2a523f2108 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -411,9 +548,19 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /*
+ * Prevent transaction abort while accessing catalogs. Bail out if
+ * transaction already aborted.
+ */
+ if (!LogicalLockTransaction(txn))
+ return;
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
+ /* Make sure to release the decoding catalog lock. */
+ LogicalUnlockTransaction(txn);
+
/* Avoid leaking memory by using and resetting our own context */
old = MemoryContextSwitchTo(data->context);
--
2.14.3 (Apple Git-98)
0007-Additional-optional-test-case-to-demonstrate-decoding-rollbac.patchapplication/octet-stream; name=0007-Additional-optional-test-case-to-demonstrate-decoding-rollbac.patchDownload
From a1493caab1d5629b2bf716512be751b47cff4d66 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 28 Mar 2018 20:27:02 +0530
Subject: [PATCH 7/7] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 102 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 135 insertions(+), 1 deletion(-)
create mode 100755 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100755
index 0000000000..d50e2c9940
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 2a523f2108..b2d6358b10 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -555,6 +572,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
if (!LogicalLockTransaction(txn))
return;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 67faae1b9e..7efaf6bed8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1386,7 +1386,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
relation = RelationIdGetRelation(reloid);
--
2.14.3 (Apple Git-98)
On 28 March 2018 at 16:28, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
is the exact patch that you had posted for an earlier commit.
0003 Pushed
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've been reviewing the last patch version, focusing mostly on the
decoding group part. Let me respond to several points first, then new
review bits.
On 03/28/2018 05:28 PM, Nikhil Sontakke wrote:
Hi Tomas,
Now, about the interlock implementation - I see you've reused the "lock
group" concept from parallel query. That may make sense, unfortunately
there's about no documentation explaining how it works, what is the
"protocol" etc. There is fairly extensive documentation for "lock
groups" in src/backend/storage/lmgr/README, but while the "decoding
group" code is inspired by it, the code is actually very different.
Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
and you'll see what I mean.So I think the first thing we need to do is add proper documentation
(possibly into the same README), explaining how the decode groups work,
how the decodeAbortPending works, etc.I have added details about this in src/backend/storage/lmgr/README as
suggested by you.
Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.
BTW, do we need to do any of this with (wal_level < logical)? I don't
see any quick bail-out in any of the functions in this case, but it
seems like a fairly obvious optimization.The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs
will be from inside plugins or the reorderbuffer code paths. Those
will get invoked only in the wal_level logical case, hence I did not
add further checks.
Oh, right.
Similarly, can't the logical workers indicate that they need to decode
2PC transactions (or in-progress transactions in general) in some way?
If we knew there are no such workers, that would also allow ignoring the
interlock, no?These APIs check if the transaction is already committed and cache
that information for further calls, so for regular transactions this
becomes a no-op
I see. So when the output plugin never calls LogicalLockTransaction on
an in-progress transaction (e.g. 2PC after PREPARE), it never actually
initializes the decoding group. Works for me.
2) regression tests
-------------------I really dislike the use of \set to run the same query repeatedly. It
makes analysis of regression failures even more tedious than it already
is. I'd just copy the query to all the places.They are long-winded queries and IMO made the test file look too
cluttered and verbose..
Well, I don't think that's a major problem, and it certainly makes it
more difficult to investigate regression failures.
3) worker.c
-----------The comment in apply_handle_rollback_prepared_txn says this:
/*
* During logical decoding, on the apply side, it's possible that a
* prepared transaction got aborted while decoding. In that case, we
* stop the decoding and abort the transaction immediately. However
* the ROLLBACK prepared processing still reaches the subscriber. In
* that case it's ok to have a missing gid
*/
if (LookupGXact(commit_data->gid)) { ... }But is it safe to assume it never happens due to an error? In other
words, is there a way to decide that the GID really aborted? Or, why
should the provider sent the rollback at all - surely it could know if
the transaction/GID was sent to subscriber or not, right?Since we decode in commit WAL order, when we reach the ROLLBACK
PREPARED wal record, we cannot be sure that we did infact abort the
decoding mid ways because of this concurrent rollback. It's possible
that this rollback comes much much later as well when all decoding
backends have successfully prepared it on the subscribers already.
Ah, OK. So when the transaction gets aborted (by ROLLBACK PREPARED)
concurrently with the decoding, we abort the apply transaction and
discard the ReorderBufferTXN.
Which means that later, when we decode the abort, we don't know whether
the decoding reached abort or prepare, and so we have to send the
ROLLBACK PREPARED to the subscriber too.
For a moment I was thinking we might simply remember TXN outcome in
reorder buffer, but obviously that does not work - the decoding might
restart in between, and as you say the distance (in terms of WAL) may be
quite significant.
7) proto.c / worker.c
---------------------Until now, the 'action' (essentially the first byte of each message)
clearly identified what the message does. So 'C' -> commit, 'I' ->
insert, 'D' -> delete etc. This also means the "handle" methods were
inherently simple, because each handled exactly one particular action
and nothing else.You've expanded the protocol in a way that suddenly 'C' means either
COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT
PREPARED. I don't think that's how the protocol should be extended - if
anything, it's damn confusing and unlike the existing code. You should
define new action, and keep the handlers in worker.c simple.I thought this grouped regular commit and 2PC transactions properly.
Can look at this again if this style is not favored.
Hmmm, it's not how I'd do it, but perhaps someone who originally
designed the protocol should review this bit.
Now, the new bits ... attached is a .diff with a couple of changes and
comments on various places.
1) LogicalLockTransaction
- This function is part of a public API, yet it has no comment. That
needs fixing - it has to be clear how to use it. The .diff suggests a
comment, but it may need improvements.
- As I mentioned in the previous review, BecomeDecodeGroupLeader is a
misleading name. It suggest the called becomes a leader, while in fact
it looks up the PROC running the XID and makes it a leader. This is
obviously due to copying the code from lock groups, where the caller
actually becomes the leader. It's incorrect here. I suggest something
like LookupDecodeGroupLeader() or something.
- In the "if (MyProc->decodeGroupLeader == NULL)" block there are two
blocks rechecking the transaction status:
if (proc == NULL)
{ ... recheck ... }
if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
{ ... recheck ...}
I suggest to join them into a single block.
- This Assert() is either bogus and there can indeed be cases with
(MyProc->decodeGroupLeader==NULL), or the "if" is unnecessary:
Assert(MyProc->decodeGroupLeader);
if (MyProc->decodeGroupLeader) { ... }
- I'm wondering why we're maintaining decodeAbortPending flags both for
the leader and all the members. ISTM it'd be perfectly fine to only
check the leader, particularly because RemoveDecodeGroupMemberLocked
removes the members from the decoding group. So that seems unnecessary,
and we can remove the
if (MyProc->decodeAbortPending)
{ ... }
- LogicalUnlockTransaction needs a comment(s) too.
2) BecomeDecodeGroupLeader
- Wrong name (already mentioned above).
- It can bail out when (!proc), which will simplify the code a bit.
- Why does it check PID of the process at all? Seems unnecessary,
considering we're already checking the XID.
- Can a proc executing a XID have a different leader? I don't think so,
so I'd make that an Assert().
Assert(!proc || (proc->decodeGroupLeader == proc));
And it'll allow simplification of some of the conditions.
- We're only dealing with prepared transactions now, so I'd just drop
the is_prepared flag - it'll make the code a bit simpler, we can add it
later in patch adding decoding of regular in-progress transactions. We
can't test the (!is_prepared) anyway.
- Why are we making the leader also a member of the group? Seems rather
unnecessary, and it complicates the abort handling, because we need to
skip the leader when deciding to wait.
3) LogicalDecodeRemoveTransaction
- It's not clear to me what happens when a decoding backend gets killed
between LogicalLockTransaction/LogicalUnlockTransaction. Doesn't that
mean LogicalDecodeRemoveTransaction will get stuck, because the proc is
still in the decoding group?
- The loop now tweaks decodeAbortPending of the members, but I don't
think that's necessary either - the LogicalUnlockTransaction can check
the leader flag just as easily.
4) a bunch of comment / docs improvements, ...
I'm suggesting rewording a couple of comments. I've also added a couple
of missing comments - e.g. to LogicalLockTransaction and the lock group
methods in general.
Also, a couple more questions and suggestions in XXX comments.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
logical-2pc-decoding-review.difftext/x-patch; name=logical-2pc-decoding-review.diffDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e2db0eb..8fbd8b8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -551,8 +551,7 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
/*
* LookupGXact
- * Check if the prepared transaction with the given GID is
- * around
+ * Check if the prepared transaction with the given GID is around
*/
bool
LookupGXact(const char *gid)
@@ -1538,9 +1537,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+
/*
- * Tell logical decoding backends interested in this XID
- * that this is going away
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
*/
LogicalDecodeRemoveTransaction(proc, isCommit);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8025d99..c378157 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1220,45 +1220,90 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
}
}
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. These functions
+ * add/remove the decoding backend from a "decoding group" for a given
+ * transaction. While aborting a prepared transaction, the backend will
+ * wait for all current members of the decoding group to leave (see
+ * LogicalDecodeRemoveTransaction).
+ *
+ * The function return true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted) in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
bool
LogicalLockTransaction(ReorderBufferTXN *txn)
{
bool ok = false;
/*
- * Prepared transactions and uncommitted transactions
- * that have modified catalogs need to interlock with
- * concurrent rollback to ensure that there are no
- * issues while decoding
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
*/
-
if (!rbtxn_has_catalog_changes(txn))
return true;
/*
- * Is it a prepared txn? Similar checks for uncommitted
- * transactions when we start supporting them
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ *
+ * XXX This may be unnecessary, because for regular transactions
+ * will be detected as committed.
*/
if (!rbtxn_prepared(txn))
return true;
- /* check cached status */
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
if (rbtxn_commit(txn))
return true;
+
if (rbtxn_rollback(txn))
return false;
/*
- * Find the PROC that is handling this XID and add ourself as a
- * decodeGroupMember
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup the leader.
*/
if (MyProc->decodeGroupLeader == NULL)
{
+ /*
+ * FIXME The name is wrong - we're not becoming group leader,
+ * we're looking up the PROC that executes the transaction and
+ * making it the leader.
+ */
PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, rbtxn_prepared(txn));
/*
- * If decodeGroupLeader is NULL, then the only possibility
- * is that the transaction completed and went away
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get decodeGroupLeader=NULL. We recheck transaction status,
+ * expecting it to be either committed or aborted.
*/
if (proc == NULL)
{
@@ -1275,7 +1320,18 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
}
}
- /* Add ourself as a decodeGroupMember */
+ /*
+ * Join the decoding group for the leader process.
+ *
+ * We're not holding any locks on PGPROC, so it's possible the
+ * leader disappears, or starts executing another transaction.
+ * In that case we're done.
+ *
+ * XXX Why not to merge those two blocks? Something like
+ *
+ * if ((proc == NULL) || (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn))))
+ * { ... recheck xact status ...}
+ */
if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
{
Assert(!TransactionIdIsInProgress(txn->xid));
@@ -1294,7 +1350,10 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
/*
* If we were able to add ourself, then Abort processing will
- * interlock with us. Check if the transaction is still around
+ * interlock with us. Check if the transaction is still around.
+ *
+ * XXX Eh? If the Assert() enforces (decodeGroupLeader!=NULL),
+ * then why is there an if condition?
*/
Assert(MyProc->decodeGroupLeader);
@@ -1304,6 +1363,12 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
LWLockAcquire(leader_lwlock, LW_SHARED);
+
+ /*
+ * XXX Why to check this again? BecomeDecodeGroupMember already
+ * checks the flag on the leader, and returns false if it's set
+ * to true.
+ */
if (MyProc->decodeAbortPending)
{
/*
@@ -1339,12 +1404,21 @@ LogicalUnlockTransaction(ReorderBufferTXN *txn)
LWLock *leader_lwlock;
/*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interruped the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
* Prepared transactions and uncommitted transactions
* that have modified catalogs need to interlock with
* concurrent rollback to ensure that there are no
* issues while decoding
*/
-
if (!rbtxn_has_catalog_changes(txn))
return;
@@ -1358,8 +1432,6 @@ LogicalUnlockTransaction(ReorderBufferTXN *txn)
/* check cached status */
if (rbtxn_commit(txn))
return;
- if (rbtxn_rollback(txn))
- return;
Assert(MyProc->decodeGroupLeader);
leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
@@ -1372,8 +1444,12 @@ LogicalUnlockTransaction(ReorderBufferTXN *txn)
LWLockRelease(leader_lwlock);
LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
- /* reset the bool to let the leader know that we are going away */
+
+ /* reset the bool to let the leader know that we are going away
+ * XXX Why? Just removing ourselves from the group should be enough.
+ */
MyProc->decodeAbortPending = false;
+
txn->txn_flags |= RBTXN_ROLLBACK;
}
MyProc->decodeLocked = false;
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 9742a34..8c25eda 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -682,30 +682,37 @@ PIDs are not recycled quickly enough for this interlock to fail.
Decode Group Locking
--------------------
-We use an infrastructure which is very similar to the above group locking
-of parallel processes to create a group of backends that are performing
-logical decoding of an uncommitted or a prepared transaction.
-
-Decode Group locking adds five new members to each PGPROC:
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
not involved in logical decoding. When a process wants to decode an
-uncommitted or prepared transaction then it finds out the PGPROC
-structure which is associated with that transaction id and makes that
-PGPROC structure as its decodeGroupLeader. The decodeGroupMembers field
-is only used in the leader; it is a list of the member PGPROCs of the
-decode group (the leader and all backends decoding this transaction id).
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
The decodeGroupLink field is the list link for this list. The decoding
backend marks itself as decodeLocked while it is accessing catalog
-metadata for its decoding requirements via the
-LogicalLockTransaction API. It resets the same via the
-LogicalUnlockTransaction API. Meanwhile, if the transaction id of this
-uncommitted or prepared transaction decides to abort then the PGPROC
-structure corresponding to it sets decodeAbortPending on itself and also
-on all the decodeGroupMembers entries. The decodeGroupMembers entries
-stop decoding of this aborted transaction and exit. When all the
-decoding backends have exited then the aborting transaction goes ahead
-with its regular processing.
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
All five of these fields are considered to be protected by a lock manager
partition lock. The partition lock that protects these fields within a given
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 26d35c7..f72ea37 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1906,10 +1906,18 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
}
/*
- * BecomeDecodeGroupLeader - designate process as decode group leader
+ * BecomeDecodeGroupLeader
+ * Designate process as decode group leader.
*
- * Once this function has returned, other processes can join the decode group
- * by calling BecomeDecodeGroupMember.
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ *
+ * XXX Should be LookupDecodeGroupLeader() or something like that. For
+ * the lock groups the "become" makes sense, because the caller is the
+ * first process that also acts as a leader. Not like here.
+ *
+ * XXX We're only handling prepared transactions now, so let's get rid
+ * of the is_prepared flag. We can't check is_prepared=false anyway.
*/
PGPROC *
BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
@@ -1920,32 +1928,95 @@ BecomeDecodeGroupLeader(TransactionId xid, bool is_prepared)
Assert(xid != InvalidTransactionId);
-
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * XXX If the transaction already completed, we can bail out. That
+ * is, we can do
+ *
+ * if (!proc)
+ * return NULL;
+ */
proc = BackendXidGetProc(xid);
- if (proc)
+ if (!proc)
pid = proc->pid;
/*
- * This proc will become decodeGroupLeader if it's
- * not already
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc || (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ *
+ * XXX How could the proc have (decodeGroupLeader != NULL) and
+ * (decodeGroupLeader != proc)? That is, why not to make the
+ * condition (proc && proc->decodeGroupLeader != NULL). Perhaps
+ * if we don't clean it up correctly at transaction end?
+ *
+ * XXX Why not to make this into
+ *
+ * if (proc->decodeGroupLeader != NULL)
+ * return;
+ *
+ * or
+ *
+ * if (proc->decodeGroupLeader == NULL)
+ * {
+ * ...
+ * }
*/
if (proc && proc->decodeGroupLeader != proc)
{
volatile PGXACT *pgxact;
+
/* Create single-member group, containing this proc. */
leader_lwlock = LockHashPartitionLockByProc(proc);
LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
/* recheck we are still the same */
pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ *
+ * XXX Why the check on PID? BecomeLockGroupMember does that,
+ * but that does not mean we need to (it's not a parameter).
+ */
if (proc->pid == pid && pgxact->xid == xid)
{
+ /*
+ * XXX Seems unnecessary, and probably should be more like
+ *
+ * Assert((is_prepared && (pid == 0)) | (!is_prepared && (pid != 0)));
+ *
+ * Also, we only handle prepared xacts now anyway.
+ * */
if (is_prepared)
Assert(pid == 0);
+
+ /*
+ * Some other decoding backend might have mark the process
+ * as a leader before we acquired the lock. But it must not
+ * be follower of some other leader.
+ */
+ Assert((proc->decodeGroupLeader == NULL) ||
+ (proc->decodeGroupLeader == proc));
+
/* recheck if someone else did not already assign us */
- if (proc->decodeGroupLeader != proc)
+ if (proc->decodeGroupLeader == NULL)
{
- /* We had better not be a follower. */
- Assert(proc->decodeGroupLeader == NULL);
+ /*
+ * XXX Why do we make the leader also a member? Doesn't
+ * it just complicate the processing later?
+ */
proc->decodeGroupLeader = proc;
dlist_push_head(&proc->decodeGroupMembers,
&proc->decodeGroupLink);
@@ -2023,9 +2094,8 @@ BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
}
/*
- * Remove a decodeGroupMember from the decodeGroupMembership of
- * decodeGroupLeader
- * Acquire lock
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
*/
void
RemoveDecodeGroupMember(PGPROC *leader)
@@ -2041,9 +2111,10 @@ RemoveDecodeGroupMember(PGPROC *leader)
}
/*
- * Remove a decodeGroupMember from the decodeGroupMembership of
- * decodeGroupLeader
- * Assumes that the caller is holding appropriate lock
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
*/
void
RemoveDecodeGroupMemberLocked(PGPROC *leader)
@@ -2059,20 +2130,24 @@ RemoveDecodeGroupMemberLocked(PGPROC *leader)
}
/*
- * Indicate to all decodeGroupMembers that this transaction is
- * going away.
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
*
- * Wait for all decodeGroupMembers to ack back before returning
- * from here but only in case of aborts.
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
*
- * This function should be called *after* the proc has been
- * removed from the procArray.
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
*
- * If the transaction is committing, it's ok for the
- * decoders to continue merrily. When it tries to lock this
- * proc, it won't find it and check for transaction status
- * and cache the commit status for future calls in
- * LogicalLockTransaction
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * XXX What happens when a decoding process joins a decoding group, but
+ * then dies/crashes before leaving it again? Won't it stay in the group
+ * forever, blocking the abort?
*/
void
LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
@@ -2085,10 +2160,14 @@ LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
leader_lwlock = LockHashPartitionLockByProc(leader);
LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
- /* mark ourself as aborting */
- if (!isCommit)
- leader->decodeAbortPending = true;
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
if (leader->decodeGroupLeader == NULL)
{
Assert(dlist_is_empty(&leader->decodeGroupMembers));
@@ -2102,18 +2181,27 @@ recheck:
Assert(!dlist_is_empty(&leader->decodeGroupMembers));
if (!isCommit)
{
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
dlist_foreach(iter, &leader->decodeGroupMembers)
{
proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
- /* mark the proc to indicate abort is pending */
+
+ /* Ignore the leader (i.e. ourselves). */
if (proc == leader)
continue;
+
+ /* mark the proc to indicate abort is pending */
if (!proc->decodeAbortPending)
{
proc->decodeAbortPending = true;
elog(DEBUG1, "marking group member (%p) from (%p) for abort",
proc, leader);
}
+
/* if the proc is currently locked, wait */
if (proc->decodeLocked)
do_wait = true;
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.
FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.
- Andres
On 03/29/2018 11:58 PM, Andres Freund wrote:
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.
The lack of docs/comments, or the fact that the decoding plugins would
need to do some lock/unlock operation?
I agree with the former, of course - docs are a must. I disagree with
the latter, though - there have been about no proposals how to do it
without the locking. If there are, I'd like to hear about it.
FWIW plugins that don't want to decode in-progress transactions don't
need to do anything, obviously.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-03-30 00:23:00 +0200, Tomas Vondra wrote:
On 03/29/2018 11:58 PM, Andres Freund wrote:
FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.The lack of docs/comments, or the fact that the decoding plugins would
need to do some lock/unlock operation?
The latter.
I agree with the former, of course - docs are a must. I disagree with
the latter, though - there have been about no proposals how to do it
without the locking. If there are, I'd like to hear about it.
I don't care. Either another solution needs to be found, or the locking
needs to be automatically performed when necessary.
Greetings,
Andres Freund
On 29/03/18 23:58, Andres Freund wrote:
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.
I have to agree with Andres here. It's also visible in the latter
patches. The pgoutput patch forgets to call these new APIs completely.
The test_decoding calls them, but it does so even when it's processing
changes for committed transaction.. I think that should be avoided as it
means potentially doing SLRU lookup for every change. So doing it right
is indeed not easy.
I as wondering how to hide this. Best idea I had so far would be to put
it in heap_beginscan (and index_beginscan given that catalog scans use
is as well) behind some condition. That would also improve performance
because locking would not need to happen for syscache hits. The problem
is however how to inform the heap_beginscan about the fact that we are
in 2PC decoding. We definitely don't want to change all the scan apis
for this. I wonder if we could add some kind of property to Snapshot
which would indicate this fact - logical decoding is using it's own
snapshots it could inject the information about being inside the 2PC
decoding.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 30/03/18 00:30, Petr Jelinek wrote:
On 29/03/18 23:58, Andres Freund wrote:
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.I have to agree with Andres here. It's also visible in the latter
patches. The pgoutput patch forgets to call these new APIs completely.
The test_decoding calls them, but it does so even when it's processing
changes for committed transaction.. I think that should be avoided as it
means potentially doing SLRU lookup for every change. So doing it right
is indeed not easy.
Ah turns out it actually does not need SLRU lookup in this case (I
missed the reorder buffer call), so I take that part back.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi Petr, Andres and Tomas
Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.
Tomas, thanks for providing your review comments based patch. I will include the
documentation that you have provided in that patch for the APIs. Will
also look at
your decodeGroupLocking related comments and submit a fresh patch soon.
FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.I have to agree with Andres here.
Ok. Let's have another go at alleviating this issue then.
I as wondering how to hide this. Best idea I had so far would be to put
it in heap_beginscan (and index_beginscan given that catalog scans use
is as well) behind some condition. That would also improve performance
because locking would not need to happen for syscache hits. The problem
is however how to inform the heap_beginscan about the fact that we are
in 2PC decoding. We definitely don't want to change all the scan apis
for this. I wonder if we could add some kind of property to Snapshot
which would indicate this fact - logical decoding is using it's own
snapshots it could inject the information about being inside the 2PC
decoding.
The idea of adding that info in the Snapshot itself is interesting. We
could introduce a logicalxid field in SnapshotData to point to the XID
that the decoding backend is interested in. This could be added only
for the 2PC case. Support in the future for in-progress transactions
could use this field as well. If it's a valid XID, we could call
LogicalLockTransaction/LogicalUnlockTransaction on that XID from
heap_beginscan/head_endscan respectively. I can also look at what
other *_beginscan APIs would need this as well.
Regards,
Nikhils
On 30/03/18 09:56, Nikhil Sontakke wrote:
I as wondering how to hide this. Best idea I had so far would be to put
it in heap_beginscan (and index_beginscan given that catalog scans use
is as well) behind some condition. That would also improve performance
because locking would not need to happen for syscache hits. The problem
is however how to inform the heap_beginscan about the fact that we are
in 2PC decoding. We definitely don't want to change all the scan apis
for this. I wonder if we could add some kind of property to Snapshot
which would indicate this fact - logical decoding is using it's own
snapshots it could inject the information about being inside the 2PC
decoding.The idea of adding that info in the Snapshot itself is interesting. We
could introduce a logicalxid field in SnapshotData to point to the XID
that the decoding backend is interested in. This could be added only
for the 2PC case. Support in the future for in-progress transactions
could use this field as well. If it's a valid XID, we could call
LogicalLockTransaction/LogicalUnlockTransaction on that XID from
heap_beginscan/head_endscan respectively. I can also look at what
other *_beginscan APIs would need this as well.
So I have spent some significant time today thinking about this (the
issue in general not this specific idea). And I think this proposal does
not work either.
The problem is that we fundamentally want two things, not one. It's true
we want to block ABORT from finishing while we are reading catalogs, but
the other important part is that we want to bail gracefully when ABORT
happened for the transaction being decoded.
In other words,, if we do the locking transparently somewhere in the
scan or catalog read or similar there is no way to let the plugin know
that it should bail. So the locking code that's called from several
layers deep would have only one option, to ERROR. I don't think we want
to throw ERRORs when transaction which is being decoded has been aborted
as that disrupts the replication.
I think that we basically only have two options here that can satisfy
both blocking ABORT and bailing gracefully in case ABORT has happened.
Either the plugin has full control over locking (as in the patch), so
that it can bail when the locking function reports that the transaction
has aborted. Or we do the locking around the plugin calls, ie directly
in logical decoding callback wrappers or similar.
Both of these options have some disadvantages. Locking inside plugin
make the plugin code much more complex if it wants to support this. For
example if I as plugin author call any function that somewhere access
syscache, I have to do the locking around that function call. Locking
around plugin callbacks can hold he lock for longer periods of time
since plugins usually end up writing to network. I think for most
use-cases of 2PC decoding the latter is more useful as plugin should be
connected to some kind transaction management solution. Also the time
should be bounded by things like wal_sender_timeout (or
statement_timeout for SQL variant of decoding).
Note that I was initially advocating against locking around whole
callbacks when Nikhil originally came up with the idea, but after we
went over several other options here and given it a lot of thought I now
think it's probably least bad way we have available. At least until
somebody figures out how to solve all the issues around reading aborted
catalog changes, but that does seem like rather large project on its
own. And if we do locking around plugin callbacks now then we can easily
switch to that solution if it ever happens without anybody having to
rewrite the plugins.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On March 30, 2018 10:27:18 AM PDT, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
. Locking
around plugin callbacks can hold he lock for longer periods of time
since plugins usually end up writing to network. I think for most
use-cases of 2PC decoding the latter is more useful as plugin should be
connected to some kind transaction management solution. Also the time
should be bounded by things like wal_sender_timeout (or
statement_timeout for SQL variant of decoding).
Quick thought: Should be simple to release lock when interacting with network. Could also have abort signal lockers.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Quick thought: Should be simple to release lock when interacting with network.
I don’t think this will be that simple. The network calls will typically happen from inside the plugins and we don’t want to make plugin authors responsible for that.
Could also have abort signal lockers.
With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not sure signaling will work if the plugin is in the midst of a network
Call.
I agree with Petr. With this decodegroup
Lock implementation we have an inexpensive but workable implementation for locking around the plugin call. Sure, the abort will be penalized but it’s bounded by the Wal sender timeout or a max of one change apply cycle.
As he mentioned if we can optimize this later we can do so without changing plugin coding semantics later.
Regards,
Nikhils
Show quoted text
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 30/03/18 19:36, Andres Freund wrote:
On March 30, 2018 10:27:18 AM PDT, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
. Locking
around plugin callbacks can hold he lock for longer periods of time
since plugins usually end up writing to network. I think for most
use-cases of 2PC decoding the latter is more useful as plugin should be
connected to some kind transaction management solution. Also the time
should be bounded by things like wal_sender_timeout (or
statement_timeout for SQL variant of decoding).Quick thought: Should be simple to release lock when interacting with network. Could also have abort signal lockers.
I thought about that as well, but then we need to change API of the
write functions of logical decoding to return info about transaction
being aborted in mean time so that plugin can abort. Seems bit ugly that
those should know about it. Alternatively we would have to disallow
multiple writes from single plugin callback. Otherwise abort can happen
during the network interaction without plugin noticing.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2018-03-30 23:49:43 +0530, Nikhil Sontakke wrote:
Quick thought: Should be simple to release lock when interacting with network.
I don’t think this will be that simple. The network calls will
typically happen from inside the plugins and we don’t want to make
plugin authors responsible for that.
You can just throw results away... ;). I'm not even kidding. We've all
the necessary access in the callback for writing from a context.
Could also have abort signal lockers.
With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not sure signaling will work if the plugin is in the midst of a network
Call.
All walsender writes are nonblocking, so that's not an issue.
Greetings,
Andres Freund
On 30/03/18 20:50, Andres Freund wrote:
Hi,
On 2018-03-30 23:49:43 +0530, Nikhil Sontakke wrote:
Quick thought: Should be simple to release lock when interacting with network.
I don’t think this will be that simple. The network calls will
typically happen from inside the plugins and we don’t want to make
plugin authors responsible for that.You can just throw results away... ;). I'm not even kidding. We've all
the necessary access in the callback for writing from a context.
You mean, if we detect abort in the write callback, set something in the
context which will make all the future writes noop until it's reset
again after we yield back to the logical decoding?
That's not the most beautiful design I've seen, but I'd be okay with
that, it seems like it would solve all the issues we have with this.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2018-03-30 21:05:29 +0200, Petr Jelinek wrote:
You mean, if we detect abort in the write callback, set something in the
context which will make all the future writes noop until it's reset
again after we yield back to the logical decoding?
Something like that, yea. I *think* doing it via signalling is going to
be a more efficient design than constantly checking, but I've not
thought it fully through.
That's not the most beautiful design I've seen, but I'd be okay with
that, it seems like it would solve all the issues we have with this.
Yea, it's not too pretty, but seems pragmatic.
Greetings,
Andres Freund
Hi Tomas,
Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.
Additional documents around the APIs incorporated from your review patch.
2) regression tests
-------------------They are long-winded queries and IMO made the test file look too
cluttered and verbose..Well, I don't think that's a major problem, and it certainly makes it
more difficult to investigate regression failures.
Changed the test files to use the actual queries everywhere now.
Now, the new bits ... attached is a .diff with a couple of changes and
comments on various places.1) LogicalLockTransaction
- This function is part of a public API, yet it has no comment. That
needs fixing - it has to be clear how to use it. The .diff suggests a
comment, but it may need improvements.
Done.
- As I mentioned in the previous review, BecomeDecodeGroupLeader is a
misleading name. It suggest the called becomes a leader, while in fact
it looks up the PROC running the XID and makes it a leader. This is
obviously due to copying the code from lock groups, where the caller
actually becomes the leader. It's incorrect here. I suggest something
like LookupDecodeGroupLeader() or something.
Done. Used AssignDecodeGroupLeader() as the function name now.
- In the "if (MyProc->decodeGroupLeader == NULL)" block there are two
blocks rechecking the transaction status:if (proc == NULL)
{ ... recheck ... }if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
{ ... recheck ...}I suggest to join them into a single block.
Done. Combined into a single block.
- This Assert() is either bogus and there can indeed be cases with
(MyProc->decodeGroupLeader==NULL), or the "if" is unnecessary:Assert(MyProc->decodeGroupLeader);
if (MyProc->decodeGroupLeader) { ... }
Done. Removed the assert now.
- I'm wondering why we're maintaining decodeAbortPending flags both for
the leader and all the members. ISTM it'd be perfectly fine to only
check the leader, particularly because RemoveDecodeGroupMemberLocked
removes the members from the decoding group. So that seems unnecessary,
and we can remove theif (MyProc->decodeAbortPending)
{ ... }
IMO, this looked clearer that each proc has been notified that an
abort is pending.
- LogicalUnlockTransaction needs a comment(s) too.
Done.
2) BecomeDecodeGroupLeader
- It can bail out when (!proc), which will simplify the code a bit.
Done.
- Why does it check PID of the process at all? Seems unnecessary,
considering we're already checking the XID.
Agreed. Especially for the current case of 2PC, the proc will have 0 as pid.
- Can a proc executing a XID have a different leader? I don't think so,
so I'd make that an Assert().Assert(!proc || (proc->decodeGroupLeader == proc));
And it'll allow simplification of some of the conditions.
Done.
- We're only dealing with prepared transactions now, so I'd just drop
the is_prepared flag - it'll make the code a bit simpler, we can add it
later in patch adding decoding of regular in-progress transactions. We
can't test the (!is_prepared) anyway.
Done.
- Why are we making the leader also a member of the group? Seems rather
unnecessary, and it complicates the abort handling, because we need to
skip the leader when deciding to wait.
The leader is part of the decode group. And other than not waiting for ourself
at abort time, no other coding complications are there AFAICS.
3) LogicalDecodeRemoveTransaction
- It's not clear to me what happens when a decoding backend gets killed
between LogicalLockTransaction/LogicalUnlockTransaction. Doesn't that
mean LogicalDecodeRemoveTransaction will get stuck, because the proc is
still in the decoding group?
SIGSEGV, SIGABRT, SIGKILL will all cause the PG instance to restart because of
possible shmem corruption issues. So I don't think the above scenario
will arise. I also did not see any related handling in the parallel
lock group case as well.
4) a bunch of comment / docs improvements, ...
I'm suggesting rewording a couple of comments. I've also added a couple
of missing comments - e.g. to LogicalLockTransaction and the lock group
methods in general.Also, a couple more questions and suggestions in XXX comments.
Incorporated relevant changes in the new patchset.
Andres, Petr:
As discussed, I have now added lock/unlock API calls around the
"apply_change" callback. This callback is now free to consult catalog
metadata without worrying about a concurrent rollback operation. Have
removed direct logicallock/logicalunlock calls from inside the
pgoutput and test_decoding plugins now. Also modified the sgml
documentation appropriately.
Am looking at how we can further optimize this by looking at the two
approaches about signaling about abort or adding abort related info in
the context, but this will be an additional patch over this patch set
anyways.
Regards,
Nikhils
Attachments:
0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0204.patchapplication/octet-stream; name=0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0204.patchDownload
From a2ca2692ef87c86421132467cce00d31ec0f1dca Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 18:48:56 +0530
Subject: [PATCH 6/6] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 102 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 135 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..d50e2c9940
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..db7becdc44 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2ba6c7ebce..7c146b8d48 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1383,7 +1383,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
relation = RelationIdGetRelation(reloid);
--
2.15.1 (Apple Git-101)
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0204.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0204.patchDownload
From 5d06ef716eee38b77e23dcde02a260676e7cc297 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 16:34:17 +0530
Subject: [PATCH 1/6] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0204.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0204.patchDownload
From 1963514ddb56a08c182502dc276bc75fedeb65e0 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 16:38:02 +0530
Subject: [PATCH 2/6] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately
---
src/backend/replication/logical/logical.c | 215 +++++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 ++++
src/backend/storage/lmgr/README | 46 +++++
src/backend/storage/lmgr/proc.c | 332 ++++++++++++++++++++++++++++++
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 ++
src/include/storage/proc.h | 26 +++
src/include/storage/procarray.h | 1 +
8 files changed, 676 insertions(+)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..2238066138 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,218 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. These functions
+ * add/remove the decoding backend from a "decoding group" for a given
+ * transaction. While aborting a prepared transaction, the backend will
+ * wait for all current members of the decoding group to leave (see
+ * LogicalDecodeRemoveTransaction).
+ *
+ * The function return true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted) in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get decodeGroupLeader=NULL. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (proc == NULL ||
+ !BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us.
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock
+ */
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+
+ /*
+ * reset the bool since it's a PGPROC field and we don't want
+ * things loitering around in it.
+ */
+ MyProc->decodeAbortPending = false;
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..90b4fa4ecd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -1887,3 +1904,318 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+ else
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert((proc->decodeGroupLeader == NULL) ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* mark the proc to indicate abort is pending */
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..45d2dbd766 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0204.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0204.patchDownload
From 89f51ca1a5d8e220db7e2c94c4061ca689594791 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 17:06:55 +0530
Subject: [PATCH 3/6] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
---
src/backend/access/transam/twophase.c | 5 +
src/backend/replication/logical/decode.c | 147 +++++++++++++++--
src/backend/replication/logical/logical.c | 193 ++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 209 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
7 files changed, 630 insertions(+), 34 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..f3091af385 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1506,6 +1506,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2238066138..a97a7c838c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,42 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * If twophase is not enabled, skip decoding at PREPARE time
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..fdce0249f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1277,25 +1277,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1388,8 +1381,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1419,7 +1418,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (!IsToastRelation(relation))
{
ReorderBufferToastReplace(rb, txn, relation, change);
+
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ break;
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1581,8 +1596,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1642,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1681,137 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1880,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2794,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..1dedf5cc42 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-pgoutput-output-plugin-support-for-logical-decoding-.0204.patchapplication/octet-stream; name=0004-pgoutput-output-plugin-support-for-logical-decoding-.0204.patchDownload
From 0028d074dc79d370c0b888756400ddc4c10de485 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 18:00:46 +0530
Subject: [PATCH 4/6] pgoutput output plugin support for logical decoding of
2PC.
Includes documentation changes and test cases.
---
doc/src/sgml/logicaldecoding.sgml | 121 +++++++++++++++++-
src/backend/access/transam/twophase.c | 38 +++++-
src/backend/replication/logical/logical.c | 11 +-
src/backend/replication/logical/proto.c | 90 ++++++++++++-
src/backend/replication/logical/reorderbuffer.c | 2 +
src/backend/replication/logical/worker.c | 147 ++++++++++++++++++++-
src/backend/replication/pgoutput/pgoutput.c | 72 ++++++++++-
src/include/access/twophase.h | 1 +
src/include/replication/logicalproto.h | 39 +++++-
src/test/subscription/t/010_twophase.pl | 163 ++++++++++++++++++++++++
10 files changed, 669 insertions(+), 15 deletions(-)
create mode 100644 src/test/subscription/t/010_twophase.pl
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..344bc6bc1c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f3091af385..3f9b524cf4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -549,6 +549,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
}
+/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs */
+ if (!gxact->valid)
+ continue;
+ if (strcmp(gxact->gid, gid) != 0)
+ continue;
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return false;
+}
+
/*
* LockGXact
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
@@ -1506,9 +1537,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+
/*
- * Tell logical decoding backends interested in this XID
- * that this is going away
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
*/
LogicalDecodeRemoveTransaction(proc, isCommit);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a97a7c838c..65382c2556 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -901,11 +901,20 @@ filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
bool ret;
/*
- * If twophase is not enabled, skip decoding at PREPARE time
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
*/
if (!ctx->enable_twophase)
return true;
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "filter_prepare";
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..ac6aebde0a 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -70,12 +70,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
*/
void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, bool is_commit)
{
uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ if (is_commit)
+ flags |= LOGICALREP_IS_COMMIT;
+ else
+ flags |= LOGICALREP_IS_ABORT;
+
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,16 +91,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
{
- /* read flags (unused for now) */
+ /* read flags */
uint8 flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!CommitFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ flags);
+
+ /* the flag is either commit or abort */
+ commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
@@ -103,6 +112,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
commit_data->committime = pq_getmsgint64(in);
}
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ /*
+ * This should only ever happen for 2PC transactions. In which case we
+ * expect to have a non-empty GID.
+ */
+ Assert(rbtxn_prepared(txn));
+ Assert(strlen(txn->gid) > 0);
+
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags |= LOGICALREP_IS_PREPARE;
+
+ /* Make sure exactly one of the expected flags is set. */
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+ /* read flags */
+ uint8 flags = pq_getmsgbyte(in);
+
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* set the action (reuse the constants used for the flags) */
+ prepare_data->prepare_type = flags;
+
+ /* read fields */
+ prepare_data->prepare_lsn = pq_getmsgint64(in);
+ prepare_data->end_lsn = pq_getmsgint64(in);
+ prepare_data->preparetime = pq_getmsgint64(in);
+
+ /* read gid (copy it into a pre-allocated buffer) */
+ strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
/*
* Write ORIGIN to the output stream.
*/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fdce0249f1..2ba6c7ebce 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1794,6 +1794,8 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
strcpy(txn->gid, gid);
if (is_commit)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fdace7eea2..56d3239491 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -486,7 +486,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (commit_data.is_commit)
+ CommitTransactionCommand();
+ else
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -506,6 +510,141 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ PrepareTransactionBlock(prepare_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ FinishPreparedTransaction(prepare_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ /*
+ * During logical decoding, on the apply side, it's possible that a
+ * prepared transaction got aborted while decoding. In that case, we stop
+ * the decoding and abort the transaction immediately. However the
+ * ROLLBACK prepared processing still reaches the subscriber. In that case
+ * it's ok to have a missing gid
+ */
+ if (LookupGXact(prepare_data->gid))
+ {
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+ FinishPreparedTransaction(prepare_data->gid, false);
+ CommitTransactionCommand();
+ }
+
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPrepareData prepare_data;
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ switch (prepare_data.prepare_type)
+ {
+ case LOGICALREP_IS_PREPARE:
+ apply_handle_prepare_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_COMMIT_PREPARED:
+ apply_handle_commit_prepared_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_ROLLBACK_PREPARED:
+ apply_handle_rollback_prepared_txn(&prepare_data);
+ break;
+
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected type of prepare message: %d",
+ prepare_data.prepare_type)));
+ }
+}
+
/*
* Handle ORIGIN message.
*
@@ -903,10 +1042,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT/ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index aa9cf5b54e..4f83978c47 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -36,11 +36,19 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -78,6 +86,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -246,7 +260,63 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginUpdateProgress(ctx);
OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_commit(ctx->out, txn, abort_lsn, false);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
OutputPluginWrite(ctx, true);
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f05cde202f..5a4da6efab 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
extern void StartPrepare(GlobalTransaction gxact);
extern void EndPrepare(GlobalTransaction gxact);
extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 116f16f42d..11e3d67223 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -25,7 +25,7 @@
* connect time.
*/
#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_VERSION_NUM 2
/* Tuple coming via logical replication. */
typedef struct LogicalRepTupleData
@@ -68,20 +68,55 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData
{
+ bool is_commit;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
} LogicalRepCommitData;
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+ uint8 prepare_type;
+ XLogRecPtr prepare_lsn;
+ XLogRecPtr end_lsn;
+ TimestampTz preparetime;
+ char gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE 0x01
+#define LOGICALREP_IS_COMMIT_PREPARED 0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_PREPARE) || \
+ (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+ (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn);
+ XLogRecPtr commit_lsn, bool is_commit);
extern void logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepPrepareData * prepare_data);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/010_twophase.pl b/src/test/subscription/t/010_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/010_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
--
2.15.1 (Apple Git-101)
0005-Teach-test_decoding-plugin-to-work-with-2PC.0204.patchapplication/octet-stream; name=0005-Teach-test_decoding-plugin-to-work-with-2PC.0204.patchDownload
From 0f788692d442042c104438eec112e3b2c03eb297 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Sun, 1 Apr 2018 18:35:24 +0530
Subject: [PATCH 5/6] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
On 29 March 2018 at 23:24, Andres Freund <andres@anarazel.de> wrote:
I agree with the former, of course - docs are a must. I disagree with
the latter, though - there have been about no proposals how to do it
without the locking. If there are, I'd like to hear about it.I don't care. Either another solution needs to be found, or the locking
needs to be automatically performed when necessary.
That seems unreasonable.
It's certainly a nice future goal to have it all happen automatically,
but we don't know what the plugin will do.
How can we ever make an unknown task happen automatically? We can't.
We have a reasonable approach here. Locking shared resources before
using them is not a radical new approach, its just standard
development. If we find a better way in the future, we can use that,
but requiring a better solution when there isn't one is unreasonable.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 March 2018 at 23:30, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
On 29/03/18 23:58, Andres Freund wrote:
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote:
I have added details about this in src/backend/storage/lmgr/README as
suggested by you.Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.FWIW, for me that's ground to not accept the feature. Burdening output
plugins with this will make their development painful (because they'll
have to adapt regularly) and correctness doubful (there's nothing
checking for the lock being skipped). Another way needs to be found.I have to agree with Andres here. It's also visible in the latter
patches. The pgoutput patch forgets to call these new APIs completely.
The test_decoding calls them, but it does so even when it's processing
changes for committed transaction.. I think that should be avoided as it
means potentially doing SLRU lookup for every change. So doing it right
is indeed not easy.
Yet you spotted these problems easily enough. Similar to finding
missing LWlocks.
I as wondering how to hide this. Best idea I had so far would be to put
it in heap_beginscan (and index_beginscan given that catalog scans use
is as well) behind some condition. That would also improve performance
because locking would not need to happen for syscache hits. The problem
is however how to inform the heap_beginscan about the fact that we are
in 2PC decoding. We definitely don't want to change all the scan apis
for this. I wonder if we could add some kind of property to Snapshot
which would indicate this fact - logical decoding is using it's own
snapshots it could inject the information about being inside the 2PC
decoding.
Perhaps, but how do we know we've covered all the right places? We
don't know what every plugin will require, do we?
The plugin needs to take responsibility for its own correctness,
whether we make it easier or not.
It seems clear that we would need a generalized API (the proposed
locking approach) to cover all requirements.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-02 09:23:10 +0100, Simon Riggs wrote:
On 29 March 2018 at 23:24, Andres Freund <andres@anarazel.de> wrote:
I agree with the former, of course - docs are a must. I disagree with
the latter, though - there have been about no proposals how to do it
without the locking. If there are, I'd like to hear about it.I don't care. Either another solution needs to be found, or the locking
needs to be automatically performed when necessary.That seems unreasonable.
It's certainly a nice future goal to have it all happen automatically,
but we don't know what the plugin will do.
No, fighting too complicated APIs is not unreasonable. And we've found
an alternative.
How can we ever make an unknown task happen automatically? We can't.
The task isn't unknown, so this just seems like a non sequitur.
Greetings,
Andres Freund
Hi,
It's certainly a nice future goal to have it all happen automatically,
but we don't know what the plugin will do.No, fighting too complicated APIs is not unreasonable. And we've found
an alternative.
PFA, latest patch set.
The LogicalLockTransaction/LogicalUnlockTransaction API implementation
using decode groups now has proper cleanup handling in case there's an
ERROR while holding the logical lock.
Rest of the patches are the same as yesterday.
Other than this, we would want to have pgoutput support for 2PC
decoding to be made optional? In that case we could add an option to
"CREATE SUBSCRIPTION". This will mean adding a new
Anum_pg_subscription_subenable_twophase attribute to Subscription
struct and related processing. Should we go down this route?
Other than this, unless am mistaken, every other issue has been taken
care of. Please do let me know if we think anything is pending in
these patch sets.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0304.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0304.patchDownload
From c69cac0d3a1b3db08b92562a23729aa422b85d4f Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 10:49:26 +0530
Subject: [PATCH 1/6] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0304.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0304.patchDownload
From 0bf9620cda513f4f7a5a8dd495743a62e30b5f7d Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 14:03:45 +0530
Subject: [PATCH 2/6] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 215 ++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 390 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 ++
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 733 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..2238066138 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,218 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. These functions
+ * add/remove the decoding backend from a "decoding group" for a given
+ * transaction. While aborting a prepared transaction, the backend will
+ * wait for all current members of the decoding group to leave (see
+ * LogicalDecodeRemoveTransaction).
+ *
+ * The function return true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted) in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get decodeGroupLeader=NULL. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (proc == NULL ||
+ !BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us.
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock
+ */
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+
+ /*
+ * reset the bool since it's a PGPROC field and we don't want
+ * things loitering around in it.
+ */
+ MyProc->decodeAbortPending = false;
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..74dd16af00 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,7 +843,7 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
*/
@@ -857,6 +874,47 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->lockGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /* Leader exited first; return its PGPROC. */
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -886,6 +944,21 @@ ProcKill(int code, Datum arg)
*procgloballist = proc;
}
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /* Return PGPROC structure (and semaphore) to appropriate freelist */
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+
/* Update shared estimate of spins_per_delay */
ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
@@ -1887,3 +1960,318 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+ else
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert((proc->decodeGroupLeader == NULL) ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* mark the proc to indicate abort is pending */
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..45d2dbd766 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0304.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0304.patchDownload
From 8cb820046385b943e98af3bc276f73b91fb525a4 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 14:08:29 +0530
Subject: [PATCH 3/6] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
---
src/backend/access/transam/twophase.c | 5 +
src/backend/replication/logical/decode.c | 147 +++++++++++++++--
src/backend/replication/logical/logical.c | 193 ++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 209 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
7 files changed, 630 insertions(+), 34 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..f3091af385 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1506,6 +1506,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+ /*
+ * Tell logical decoding backends interested in this XID
+ * that this is going away
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
/*
* In case we fail while running the callbacks, mark the gxact invalid so
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2238066138..a97a7c838c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,42 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * If twophase is not enabled, skip decoding at PREPARE time
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..fdce0249f1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1277,25 +1277,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1388,8 +1381,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1419,7 +1418,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (!IsToastRelation(relation))
{
ReorderBufferToastReplace(rb, txn, relation, change);
+
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ break;
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1581,8 +1596,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1642,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1681,137 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1880,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2794,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..1dedf5cc42 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char gid[GIDSIZE];
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-pgoutput-output-plugin-support-for-logical-decoding-.0304.patchapplication/octet-stream; name=0004-pgoutput-output-plugin-support-for-logical-decoding-.0304.patchDownload
From 2d7446dfdf01be6917d75e0482daf20561154dcb Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 16:00:39 +0530
Subject: [PATCH 4/6] pgoutput output plugin support for logical decoding of
2PC.
Includes documentation changes and test cases.
---
doc/src/sgml/logicaldecoding.sgml | 121 +++++++++++++++++-
src/backend/access/transam/twophase.c | 38 +++++-
src/backend/replication/logical/logical.c | 11 +-
src/backend/replication/logical/proto.c | 90 ++++++++++++-
src/backend/replication/logical/reorderbuffer.c | 2 +
src/backend/replication/logical/worker.c | 147 ++++++++++++++++++++-
src/backend/replication/pgoutput/pgoutput.c | 72 ++++++++++-
src/include/access/twophase.h | 1 +
src/include/replication/logicalproto.h | 39 +++++-
src/test/subscription/t/010_twophase.pl | 163 ++++++++++++++++++++++++
10 files changed, 669 insertions(+), 15 deletions(-)
create mode 100644 src/test/subscription/t/010_twophase.pl
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..344bc6bc1c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f3091af385..3f9b524cf4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -549,6 +549,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
}
+/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs */
+ if (!gxact->valid)
+ continue;
+ if (strcmp(gxact->gid, gid) != 0)
+ continue;
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return true;
+ }
+
+ LWLockRelease(TwoPhaseStateLock);
+
+ return false;
+}
+
/*
* LockGXact
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
@@ -1506,9 +1537,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
gid);
ProcArrayRemove(proc, latestXid);
+
/*
- * Tell logical decoding backends interested in this XID
- * that this is going away
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
*/
LogicalDecodeRemoveTransaction(proc, isCommit);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a97a7c838c..65382c2556 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -901,11 +901,20 @@ filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
bool ret;
/*
- * If twophase is not enabled, skip decoding at PREPARE time
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
*/
if (!ctx->enable_twophase)
return true;
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "filter_prepare";
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 948343e4ae..ac6aebde0a 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -70,12 +70,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
*/
void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, bool is_commit)
{
uint8 flags = 0;
pq_sendbyte(out, 'C'); /* sending COMMIT */
+ if (is_commit)
+ flags |= LOGICALREP_IS_COMMIT;
+ else
+ flags |= LOGICALREP_IS_ABORT;
+
/* send the flags field (unused for now) */
pq_sendbyte(out, flags);
@@ -86,16 +91,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
}
/*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
*/
void
logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
{
- /* read flags (unused for now) */
+ /* read flags */
uint8 flags = pq_getmsgbyte(in);
- if (flags != 0)
- elog(ERROR, "unrecognized flags %u in commit message", flags);
+ if (!CommitFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in commit|abort message",
+ flags);
+
+ /* the flag is either commit or abort */
+ commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);
@@ -103,6 +112,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
commit_data->committime = pq_getmsgint64(in);
}
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'P'); /* sending PREPARE protocol */
+
+ /*
+ * This should only ever happen for 2PC transactions. In which case we
+ * expect to have a non-empty GID.
+ */
+ Assert(rbtxn_prepared(txn));
+ Assert(strlen(txn->gid) > 0);
+
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags |= LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags |= LOGICALREP_IS_PREPARE;
+
+ /* Make sure exactly one of the expected flags is set. */
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* send the flags field */
+ pq_sendbyte(out, flags);
+
+ /* send fields */
+ pq_sendint64(out, prepare_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
+
+ /* send gid */
+ pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+ /* read flags */
+ uint8 flags = pq_getmsgbyte(in);
+
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+ /* set the action (reuse the constants used for the flags) */
+ prepare_data->prepare_type = flags;
+
+ /* read fields */
+ prepare_data->prepare_lsn = pq_getmsgint64(in);
+ prepare_data->end_lsn = pq_getmsgint64(in);
+ prepare_data->preparetime = pq_getmsgint64(in);
+
+ /* read gid (copy it into a pre-allocated buffer) */
+ strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
/*
* Write ORIGIN to the output stream.
*/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fdce0249f1..2ba6c7ebce 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1794,6 +1794,8 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
txn->commit_time = commit_time;
txn->origin_id = origin_id;
txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
strcpy(txn->gid, gid);
if (is_commit)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fdace7eea2..56d3239491 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -486,7 +486,11 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_lsn = commit_data.end_lsn;
replorigin_session_origin_timestamp = commit_data.committime;
- CommitTransactionCommand();
+ if (commit_data.is_commit)
+ CommitTransactionCommand();
+ else
+ AbortCurrentTransaction();
+
pgstat_report_stat(false);
store_flush_position(commit_data.end_lsn);
@@ -506,6 +510,141 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ PrepareTransactionBlock(prepare_data->gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ }
+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /* there is no transaction when COMMIT PREPARED is called */
+ ensure_transaction();
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ FinishPreparedTransaction(prepare_data->gid, true);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data->end_lsn;
+ replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+ /*
+ * During logical decoding, on the apply side, it's possible that a
+ * prepared transaction got aborted while decoding. In that case, we stop
+ * the decoding and abort the transaction immediately. However the
+ * ROLLBACK prepared processing still reaches the subscriber. In that case
+ * it's ok to have a missing gid
+ */
+ if (LookupGXact(prepare_data->gid))
+ {
+ /* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+ ensure_transaction();
+ FinishPreparedTransaction(prepare_data->gid, false);
+ CommitTransactionCommand();
+ }
+
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data->end_lsn);
+ in_remote_transaction = false;
+
+ /* Process any tables that are being synchronized in parallel. */
+ process_syncing_tables(prepare_data->end_lsn);
+
+ pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPrepareData prepare_data;
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ switch (prepare_data.prepare_type)
+ {
+ case LOGICALREP_IS_PREPARE:
+ apply_handle_prepare_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_COMMIT_PREPARED:
+ apply_handle_commit_prepared_txn(&prepare_data);
+ break;
+
+ case LOGICALREP_IS_ROLLBACK_PREPARED:
+ apply_handle_rollback_prepared_txn(&prepare_data);
+ break;
+
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected type of prepare message: %d",
+ prepare_data.prepare_type)));
+ }
+}
+
/*
* Handle ORIGIN message.
*
@@ -903,10 +1042,14 @@ apply_dispatch(StringInfo s)
case 'B':
apply_handle_begin(s);
break;
- /* COMMIT */
+ /* COMMIT/ABORT */
case 'C':
apply_handle_commit(s);
break;
+ /* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+ case 'P':
+ apply_handle_prepare(s);
+ break;
/* INSERT */
case 'I':
apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index aa9cf5b54e..4f83978c47 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -36,11 +36,19 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn);
static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
static bool publications_valid;
@@ -78,6 +86,12 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pgoutput_begin_txn;
cb->change_cb = pgoutput_change;
cb->commit_cb = pgoutput_commit_txn;
+ cb->abort_cb = pgoutput_abort_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
+
cb->filter_by_origin_cb = pgoutput_origin_filter;
cb->shutdown_cb = pgoutput_shutdown;
}
@@ -246,7 +260,63 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginUpdateProgress(ctx);
OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_commit(ctx->out, txn, abort_lsn, false);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ OutputPluginUpdateProgress(ctx);
+
+ OutputPluginPrepareWrite(ctx, true);
+ logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
OutputPluginWrite(ctx, true);
}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f05cde202f..5a4da6efab 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
extern void StartPrepare(GlobalTransaction gxact);
extern void EndPrepare(GlobalTransaction gxact);
extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 116f16f42d..11e3d67223 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -25,7 +25,7 @@
* connect time.
*/
#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
-#define LOGICALREP_PROTO_VERSION_NUM 1
+#define LOGICALREP_PROTO_VERSION_NUM 2
/* Tuple coming via logical replication. */
typedef struct LogicalRepTupleData
@@ -68,20 +68,55 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData
{
+ bool is_commit;
XLogRecPtr commit_lsn;
XLogRecPtr end_lsn;
TimestampTz committime;
} LogicalRepCommitData;
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT 0x01
+#define LOGICALREP_IS_ABORT 0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+ uint8 prepare_type;
+ XLogRecPtr prepare_lsn;
+ XLogRecPtr end_lsn;
+ TimestampTz preparetime;
+ char gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE 0x01
+#define LOGICALREP_IS_COMMIT_PREPARED 0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED 0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+ ((flags == LOGICALREP_IS_PREPARE) || \
+ (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+ (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
extern void logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data);
extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn);
+ XLogRecPtr commit_lsn, bool is_commit);
extern void logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+ LogicalRepPrepareData * prepare_data);
extern void logicalrep_write_origin(StringInfo out, const char *origin,
XLogRecPtr origin_lsn);
extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/010_twophase.pl b/src/test/subscription/t/010_twophase.pl
new file mode 100644
index 0000000000..c7f373df93
--- /dev/null
+++ b/src/test/subscription/t/010_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+ ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+ 'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+ "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+ is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+ "BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+ "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+ is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+ is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab_full VALUES (12);
+ INSERT INTO tab_full VALUES (13);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+ or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+ 'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
--
2.15.1 (Apple Git-101)
0005-Teach-test_decoding-plugin-to-work-with-2PC.0304.patchapplication/octet-stream; name=0005-Teach-test_decoding-plugin-to-work-with-2PC.0304.patchDownload
From 8800c677457c56bef4101717495866811191ab1c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 16:01:48 +0530
Subject: [PATCH 5/6] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0304.patchapplication/octet-stream; name=0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0304.patchDownload
From d9f03b87cdf947ec1ea22263e8206e8bf3795de8 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Tue, 3 Apr 2018 16:02:47 +0530
Subject: [PATCH 6/6] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 102 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 135 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..d50e2c9940
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..db7becdc44 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2ba6c7ebce..7c146b8d48 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1383,7 +1383,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
relation = RelationIdGetRelation(reloid);
--
2.15.1 (Apple Git-101)
On 04/03/2018 12:40 PM, Nikhil Sontakke wrote:
Hi,
It's certainly a nice future goal to have it all happen automatically,
but we don't know what the plugin will do.No, fighting too complicated APIs is not unreasonable. And we've found
an alternative.PFA, latest patch set.
The LogicalLockTransaction/LogicalUnlockTransaction API implementation
using decode groups now has proper cleanup handling in case there's an
ERROR while holding the logical lock.Rest of the patches are the same as yesterday.
Unfortunately, this does segfault for me in `make check` almost
immediately. Try
./configure --enable-debug --enable-cassert CFLAGS="-O0 -ggdb3
-DRANDOMIZE_ALLOCATED_MEMORY" && make -s clean && make -s -j4 check
and you should get an assert failure right away. Examples of backtraces
attached, not sure what exactly is the issue.
Also, I get this compiler warning:
proc.c: In function ‘AssignDecodeGroupLeader’:
proc.c:1975:8: warning: variable ‘pid’ set but not used
[-Wunused-but-set-variable]
int pid;
^~~
All of PostgreSQL successfully made. Ready to install.
which suggests we don't really need the pid variable.
Other than this, we would want to have pgoutput support for 2PC
decoding to be made optional? In that case we could add an option to
"CREATE SUBSCRIPTION". This will mean adding a new
Anum_pg_subscription_subenable_twophase attribute to Subscription
struct and related processing. Should we go down this route?
I'd say yes, we need to make it opt-in (assuming we want pgoutput to
support the 2PC decoding at all).
The trouble is that while it may improve replication of two-phase
transactions, it may also require config changes on the subscriber (to
support enough prepared transactions) and furthermore the GID is going
to be copied to the subscriber.
Which means that if the publisher/subscriber (at the instance level) are
already part of the are on the same 2PC transaction, it can't possibly
proceed because the subscriber won't be able to do PREPARE TRANSACTION.
So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled'.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
On 3 Apr 2018, at 16:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled’.
+1
Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken. So ISTM it’s better to set
LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate that.
--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 04/03/2018 04:07 PM, Stas Kelvich wrote:
On 3 Apr 2018, at 16:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled’.+1
Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken. So ISTM it’s better to set
LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate that.
Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra wrote:
Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.
Maybe it'd be good to have a buildfarm animal to specifically test for
that?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
Tomas Vondra wrote:
Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.Maybe it'd be good to have a buildfarm animal to specifically test for
that?
Not sure a buildfarm supports running two clusters with different
versions easily?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
Tomas Vondra wrote:
Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.
Maybe it'd be good to have a buildfarm animal to specifically test for
that?
Not sure a buildfarm supports running two clusters with different
versions easily?
You'd need some specialized buildfarm infrastructure like --- maybe the
same as --- the infrastructure for testing cross-version pg_upgrade.
Andrew could speak to the details better than I.
regards, tom lane
FWIW, a couple of additional comments based on eyeballing the diffs:
1) twophase.c
---------
I think this comment is slightly inaccurate:
/*
* Coordinate with logical decoding backends that may be already
* decoding this prepared transaction. When aborting a transaction,
* we need to wait for all of them to leave the decoding group. If
* committing, we simply remove all members from the group.
*/
Strictly speaking, we're not waiting for the workers to leave the
decoding group, but to set decodeLocked=false. That is, we may proceed
when there still are members, but they must be in unlocked state.
2) reorderbuffer.c
------------------
I've already said it before, I find the "flags" bitmask and rbtxn_*
macros way less readable than individual boolean flags. It was claimed
this was done on Andres' request, but I don't see that in the thread. I
admit it's rather subjective, though.
I see ReorederBuffer only does the lock/unlock around apply_change and
RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can
do heap_open on pg_class, for example. And I guess we need to protect
rb->message too, because who knows what the plugin does in the callback?
Also, we should not allocate gid[GIDSIZE] for every transaction. For
example subxacts never need it, and it seems rather wasteful to allocate
200B when the rest of the struct has only ~100B. This is particularly
problematic considering ReorderBufferTXN is not spilled to disk when
reaching the memory limit. It needs to be allocated ad-hoc only when
actually needed.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
Unfortunately, this does segfault for me in `make check` almost
immediately. Try
This is due to the new ERROR handling code that I added today for the
lock/unlock APIs. Will fix.
Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken.
Good point. Will also test pg-10 to pg-11 logical replication to
ensure that there are no issues.
So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled'.
Ok, will add it to the "CREATE SUBSCRIPTION", btw, we should have
allowed storing options in an array form for a subscription. We might
add more options in the future and adding fields one by one doesn't
seem that extensible.
1) twophase.c
---------I think this comment is slightly inaccurate:
/*
* Coordinate with logical decoding backends that may be already
* decoding this prepared transaction. When aborting a transaction,
* we need to wait for all of them to leave the decoding group. If
* committing, we simply remove all members from the group.
*/Strictly speaking, we're not waiting for the workers to leave the
decoding group, but to set decodeLocked=false. That is, we may proceed
when there still are members, but they must be in unlocked state.
Agreed. Will modify it to mention that it will wait only if some of
the backends are in locked state.
2) reorderbuffer.c
------------------I've already said it before, I find the "flags" bitmask and rbtxn_*
macros way less readable than individual boolean flags. It was claimed
this was done on Andres' request, but I don't see that in the thread. I
admit it's rather subjective, though.
Yeah, this is a little subjective.
I see ReorederBuffer only does the lock/unlock around apply_change and
RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can
do heap_open on pg_class, for example. And I guess we need to protect
rb->message too, because who knows what the plugin does in the callback?Also, we should not allocate gid[GIDSIZE] for every transaction. For
example subxacts never need it, and it seems rather wasteful to allocate
200B when the rest of the struct has only ~100B. This is particularly
problematic considering ReorderBufferTXN is not spilled to disk when
reaching the memory limit. It needs to be allocated ad-hoc only when
actually needed.
OK, will look at allocating GID only when needed.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On Wed, Apr 4, 2018 at 12:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On 04/03/2018 04:37 PM, Alvaro Herrera wrote:
Tomas Vondra wrote:
Yes, that is a good point actually - we need to test that replication
between PG10 and PG11 works correctly, i.e. that the protocol version is
correctly negotiated, and features are disabled/enabled accordingly etc.Maybe it'd be good to have a buildfarm animal to specifically test for
that?Not sure a buildfarm supports running two clusters with different
versions easily?You'd need some specialized buildfarm infrastructure like --- maybe the
same as --- the infrastructure for testing cross-version pg_upgrade.
Andrew could speak to the details better than I.
It's quite possible. The cross-version upgrade module saves out each
built version. See
<https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgradeXversion.pm>
Since this occupies a significant amount of disk space we'd probably
want to leverage it rather than have another module do the same thing.
Perhaps the "save" part of it needs to be factored out.
In any case, it's quite doable. I can work on that after this gets committed.
Currently we seem to have only two machines doing the cross-version
upgrade checks, which might make it easier to rearrange anything if
necessary.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
This is due to the new ERROR handling code that I added today for the
lock/unlock APIs. Will fix.
Fixed. I continue to test this area for other issues.
Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected
flags to be zero, so this way logical replication between postgres-10 and
postgres-with-2pc-decoding will be broken.Good point. Will also test pg-10 to pg-11 logical replication to
ensure that there are no issues.
I started making changes for supporting replication between
postgres-10 and postgres-11 but then very quickly realized that
pgoutput support is too far from being done. It needs to be optional
and per subscription. It definitely needs proto version bump and we
don't even have a framework for negotiating proto version yet (since
the proto was never bumped) so there is a chunk of completely new code
missing. For demo and functionality purposes we have test_decoding
support for 2pc decoding in this patch set. External plugins like bdr
and pglogical will be able to leverage this infrastructure as well.
Importantly, since we don't do negotiation then PG10 -> PG11
replication is not possible making one of the most important current
use cases not possible. To add support in pgoutput, we'd first have to
get multi-protocol publisher/subscriber communication working as a
pre-requisite. The good thing is that once we get the proto stuff in,
we can easily add the patch from the earlier patchset which provides
full 2PC decoding support in pgoutput.
Thoughts?
So, we should consider not adding pgoutput support right away and I
have removed that patch from this patchset now. Another aspect of not
working on pgoutput is we need not worry about adding an
enable_twophase option to CREATE SUBSCRIPTION immediately as well. The
test_decoding plugin is easy to extend with options and the patch set
already does that for enabling/disabling 2PC decoding in it.
So I think we need a subscription parameter to enable/disable this,
defaulting to 'disabled'.
Ok, will add it to the "CREATE SUBSCRIPTION", btw, we should have
allowed storing options in an array form for a subscription. We might
add more options in the future and adding fields one by one doesn't
seem that extensible.
This is not needed since we should not look at pgoutput 2PC decode support now.
1) twophase.c
---------I think this comment is slightly inaccurate:
/*
* Coordinate with logical decoding backends that may be already
* decoding this prepared transaction. When aborting a transaction,
* we need to wait for all of them to leave the decoding group. If
* committing, we simply remove all members from the group.
*/Strictly speaking, we're not waiting for the workers to leave the
decoding group, but to set decodeLocked=false. That is, we may proceed
when there still are members, but they must be in unlocked state.Agreed. Will modify it to mention that it will wait only if some of
the backends are in locked state.
Modified the comment.
2) reorderbuffer.c
------------------I've already said it before, I find the "flags" bitmask and rbtxn_*
macros way less readable than individual boolean flags. It was claimed
this was done on Andres' request, but I don't see that in the thread. I
admit it's rather subjective, though.Yeah, this is a little subjective.
If the committer has strong opinions on this, then I can whip up
patches along desired lines.
I see ReorederBuffer only does the lock/unlock around apply_change and
RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can
do heap_open on pg_class, for example. And I guess we need to protect
rb->message too, because who knows what the plugin does in the callback?
Added lock/unlock APIs around rb->message and other places where
Relations are fetched.
Also, we should not allocate gid[GIDSIZE] for every transaction. For
example subxacts never need it, and it seems rather wasteful to allocate
200B when the rest of the struct has only ~100B. This is particularly
problematic considering ReorderBufferTXN is not spilled to disk when
reaching the memory limit. It needs to be allocated ad-hoc only when
actually needed.OK, will look at allocating GID only when needed.
Done. Now GID is a char pointer and gets palloc'ed and pfree'd.
PFA, latest patchset.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.patchDownload
From 976f5bb8524075dd6a5b6eb83ecbbed3a5b897a3 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 11:49:24 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 215 +++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 424 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 ++
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 760 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..2238066138 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,218 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. These functions
+ * add/remove the decoding backend from a "decoding group" for a given
+ * transaction. While aborting a prepared transaction, the backend will
+ * wait for all current members of the decoding group to leave (see
+ * LogicalDecodeRemoveTransaction).
+ *
+ * The function return true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted) in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get decodeGroupLeader=NULL. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (proc == NULL ||
+ !BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us.
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock
+ */
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+
+ /*
+ * reset the bool since it's a PGPROC field and we don't want
+ * things loitering around in it.
+ */
+ MyProc->decodeAbortPending = false;
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..6dbe39a0a2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,7 +843,7 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
*/
@@ -845,11 +862,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +882,53 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->lockGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -882,8 +954,29 @@ ProcKill(int code, Datum arg)
Assert(dlist_is_empty(&proc->lockGroupMembers));
/* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /* Return PGPROC structure (and semaphore) to appropriate freelist */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1980,318 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+ else
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert((proc->decodeGroupLeader == NULL) ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* mark the proc to indicate abort is pending */
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..45d2dbd766 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.patchDownload
From b90894a844c9f6ba503af6ef3abcdd10866095a3 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 13:03:15 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++--
src/backend/replication/logical/logical.c | 202 +++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 225 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 783 insertions(+), 37 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..b11752789d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -640,7 +757,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..30ebe5e72d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1507,6 +1507,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2238066138..65382c2556 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,51 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..1c7dbd3ade 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1277,25 +1282,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1372,8 +1370,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1388,8 +1390,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1418,8 +1426,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ break;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1492,10 +1515,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
case REORDER_BUFFER_CHANGE_MESSAGE:
+ /* XXX does rb->message need lock/unlock? */
+ if (!LogicalLockTransaction(txn))
+ break;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1581,8 +1608,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1654,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1693,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1896,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2810,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..d890e6628c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.patchDownload
From 6a08cfa974f2f5366896f9273c2a77805785bdc2 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 13:35:55 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.patchapplication/octet-stream; name=0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.patchDownload
From 05a90d44eeb75ed9684835fe1abefd58fbaf1774 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 13:45:44 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 ++++-
contrib/test_decoding/test_decoding.c | 24 ++++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 5 +++++
3 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..db7becdc44 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1c7dbd3ade..adb6adef88 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1372,7 +1372,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.1 (Apple Git-101)
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.patchDownload
From 029d2bea5d20b3035e6db9c975a3f7035a151b4a Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 11:39:47 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
This is due to the new ERROR handling code that I added today for the
lock/unlock APIs. Will fix.Fixed. I continue to test this area for other issues.
Revised the patch after more testing and added more documentation in
the ERROR handling code path.
I tested ERROR handling by ensuring that LogicalLock is held by
multiple backends and induced ERROR while holding it. The handling in
ProcKill rightly removes entries from these backends as part of ERROR
cleanup. A future ROLLBACK removes the only one entry belonging to the
Leader from the decodeGroup appropriately later. Seems to be holding
up ok
Had also missed out a new test file for the option 0005 patch earlier.
That's also included now.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.v2.0.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.v2.0.patchDownload
From db701753638e9eeeec22b820758eb034e14ba4e6 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 13:59:21 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.v2.0.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.v2.0.patchDownload
From 0cadf85ed535a3bf1983c766d5d16bc359ef03c9 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 17:16:41 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 215 +++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 438 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 +
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 773 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..2238066138 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,218 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. These functions
+ * add/remove the decoding backend from a "decoding group" for a given
+ * transaction. While aborting a prepared transaction, the backend will
+ * wait for all current members of the decoding group to leave (see
+ * LogicalDecodeRemoveTransaction).
+ *
+ * The function return true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted) in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ PGPROC *proc = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get decodeGroupLeader=NULL. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (proc == NULL ||
+ !BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * If we were able to add ourself, then Abort processing will
+ * interlock with us.
+ */
+ Assert(MyProc->decodeGroupLeader);
+
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock
+ */
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ /* reset the bool to let the leader know that we are going away */
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ Assert(MyProc->decodeGroupLeader);
+ leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
+ LWLockAcquire(leader_lwlock, LW_SHARED);
+ if (MyProc->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ LWLockRelease(leader_lwlock);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+
+ /*
+ * reset the bool since it's a PGPROC field and we don't want
+ * things loitering around in it.
+ */
+ MyProc->decodeAbortPending = false;
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..2c002a2274 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,9 +843,14 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
+ *
+ * The below code needs to be mindful of the presence of decode group
+ * entries in case of logical decoding. However, lock groups are for
+ * parallel workers so we typically won't be finding both present
+ * together in the same proc.
*/
if (MyProc->lockGroupLeader != NULL)
{
@@ -845,11 +867,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +887,54 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ * by the lockGroup code
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -881,9 +959,36 @@ ProcKill(int code, Datum arg)
/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
Assert(dlist_is_empty(&proc->lockGroupMembers));
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist.
+ * Again check if decode group stuff will handle it later.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist
+ * But check if it was already done above by the lockGroup code
+ */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1992,318 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ int pid;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (proc)
+ pid = proc->pid;
+ else
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert((proc->decodeGroupLeader == NULL) ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* PID must be valid OR this is a prepared transaction. */
+ Assert(pid != 0 || is_prepared);
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ if (leader->pid == pid && leader->decodeGroupLeader == leader)
+ {
+ if (is_prepared)
+ Assert(pid == 0);
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ else
+ MyProc->decodeGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* mark the proc to indicate abort is pending */
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..45d2dbd766 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.v2.0.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.v2.0.patchDownload
From abc813a609d64b4afee687150910c0d2081d2690 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 17:17:29 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++--
src/backend/replication/logical/logical.c | 202 +++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 225 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 783 insertions(+), 37 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..b11752789d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -640,7 +757,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..30ebe5e72d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1507,6 +1507,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2238066138..65382c2556 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,51 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..1c7dbd3ade 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1277,25 +1282,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1372,8 +1370,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1388,8 +1390,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1418,8 +1426,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ break;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1492,10 +1515,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
case REORDER_BUFFER_CHANGE_MESSAGE:
+ /* XXX does rb->message need lock/unlock? */
+ if (!LogicalLockTransaction(txn))
+ break;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1581,8 +1608,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1654,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1693,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1896,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2810,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..d890e6628c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.v2.0.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.v2.0.patchDownload
From 1ef6691faaed1430cf9c9f70b116d598242798b3 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 17:18:13 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.v2.0.patchapplication/octet-stream; name=0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.v2.0.patchDownload
From e067582bf8cb05956222f778d377c049f7777726 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 4 Apr 2018 17:18:53 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 102 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 135 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..d50e2c9940
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..db7becdc44 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1c7dbd3ade..adb6adef88 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1372,7 +1372,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.1 (Apple Git-101)
Hi,
I think the patch looks mostly fine. I'm about to do a bit more testing
on it, but a few comments. Attached diff shows which the discussed
places / comments more closely.
1) There's a race condition in LogicalLockTransaction. The code does
roughly this:
if (!BecomeDecodeGroupMember(...))
... bail out ...
Assert(MyProc->decodeGroupLeader);
lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
...
but AFAICS there is no guarantee that the transaction does not commit
(or even abort) right after the become decode group member. In which
case LogicalDecodeRemoveTransaction might have already reset our pointer
to a leader to NULL. In which case the Assert() and lock will fail.
I've initially thought this can be fixed by setting decodeLocked=true in
BecomeDecodeGroupMember, but that's not really true - that would fix the
race for aborts, but not commits. LogicalDecodeRemoveTransaction skips
the wait for commits entirely, and just resets the flags anyway.
So this needs a different fix, I think. BecomeDecodeGroupMember also
needs the leader PGPROC pointer, but it does not have the issue because
it gets it as a parameter. I think the same thing would work for here
too - that is, use the AssignDecodeGroupLeader() result instead.
2) BecomeDecodeGroupMember sets the decodeGroupLeader=NULL when the
leader does not match the parameters, despite enforcing it by Assert()
at the beginning. Let's remove that assignment.
3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?
4) AssignDecodeGroupLeader still sets pid, which is never read. Remove.
5) ReorderBufferCommitInternal does elog(LOG) about interrupting the
decoding of aborted transaction only in one place. There are about three
other places where we check LogicalLockTransaction. Seems inconsistent.
6) The comment before LogicalLockTransaction is somewhat inaccurate,
because it talks about adding/removing the backend to the group, but
that's not what's happening. We join the group on the first call and
then we only tweak the decodeLocked flag.
7) I propose minor changes to a couple of comments.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
2pc-review.difftext/x-patch; name=2pc-review.diffDownload
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 65382c2..b8b73a4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1236,13 +1236,19 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
* while accessing any catalogs. To enforce that, each decoding backend
* has to call LogicalLockTransaction prior to any catalog access, and
* then LogicalUnlockTransaction immediately after it. These functions
- * add/remove the decoding backend from a "decoding group" for a given
- * transaction. While aborting a prepared transaction, the backend will
- * wait for all current members of the decoding group to leave (see
- * LogicalDecodeRemoveTransaction).
+ * add the decoding backend into a "decoding group" for the transaction
+ * (on the first call), and then update a flag indicating whether the
+ * decoding backend may be accessing any catalogs.
*
- * The function return true when it's safe to access catalogs, and
- * false when the transaction aborted (or is being aborted) in which
+ * While aborting a prepared transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
* case the plugin should stop decoding it.
*
* The decoding backend joins the decoding group only when actually
@@ -1324,6 +1330,12 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
/*
* If we were able to add ourself, then Abort processing will
* interlock with us.
+ *
+ * XXX There's a race condition here, I think. BecomeDecodeGroupMember
+ * made us a member of the group, but the transaction might have
+ * finished since then. In which case (decodeGroupLeader == NULL).
+ * We need to set (decodeLocked = true) in BecomeDecodeGroupMember,
+ * so that the leader waits for us.
*/
Assert(MyProc->decodeGroupLeader);
@@ -1333,6 +1345,9 @@ LogicalLockTransaction(ReorderBufferTXN *txn)
/*
* Re-check if we were told to abort by the leader after taking
* the above lock
+ *
+ * XXX It's not quite clear to me why we need the separate flag
+ * in our process. Why not to simply check the leader's flag?
*/
if (MyProc->decodeAbortPending)
{
@@ -1410,7 +1425,12 @@ LogicalUnlockTransaction(ReorderBufferTXN *txn)
if (rbtxn_commit(txn))
return;
+ /*
+ * We're guaranteed to still have a leader here, because were are
+ * in locked mode, so the leader can't just disappear.
+ */
Assert(MyProc->decodeGroupLeader);
+
leader_lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
LWLockAcquire(leader_lwlock, LW_SHARED);
if (MyProc->decodeAbortPending)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index adb6ade..908eada 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1396,6 +1396,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
MAIN_FORKNUM));
/* Lock transaction before catalog access */
+ /* XXX Why no elog(LOG) here? */
if (!LogicalLockTransaction(txn))
break;
@@ -1443,6 +1444,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
* transaction will be around for the duration of
* the apply_change call below
*/
+ /* XXX Why no elog(LOG) here? */
if (!LogicalLockTransaction(txn))
break;
ReorderBufferToastReplace(rb, txn, relation, change);
@@ -1520,7 +1522,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
break;
case REORDER_BUFFER_CHANGE_MESSAGE:
- /* XXX does rb->message need lock/unlock? */
+ /* XXX Why no elog(LOG) here? */
if (!LogicalLockTransaction(txn))
break;
rb->message(rb, txn, change->lsn, true,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2c002a2..3fc3a65 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -2004,7 +2004,6 @@ PGPROC *
AssignDecodeGroupLeader(TransactionId xid)
{
PGPROC *proc = NULL;
- int pid;
LWLock *leader_lwlock;
Assert(xid != InvalidTransactionId);
@@ -2015,9 +2014,7 @@ AssignDecodeGroupLeader(TransactionId xid)
* If the transaction already completed, we can bail out.
*/
proc = BackendXidGetProc(xid);
- if (proc)
- pid = proc->pid;
- else
+ if (!proc)
return NULL;
/*
@@ -2093,6 +2090,10 @@ AssignDecodeGroupLeader(TransactionId xid)
* that, we require the caller to pass the PID of the intended PGPROC as
* an interlock. Returns true if we successfully join the intended lock
* group, and false if not.
+ *
+ * XXX Not sure why are we passing-in the PID, considering we only deal
+ * with prepared transactions now, which means (pid==0). Shouldn't we
+ * use XID instead, for example?
*/
bool
BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
@@ -2107,7 +2108,7 @@ BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
Assert(MyProc->decodeGroupLeader == NULL);
/* PID must be valid OR this is a prepared transaction. */
- Assert(pid != 0 || is_prepared);
+ Assert(((pid != 0) && !is_prepared) || ((pid == 0) && is_prepared));
/*
* Get lock protecting the group fields. Note LockHashPartitionLockByProc
@@ -2122,8 +2123,6 @@ BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
/* Is this the leader we're looking for? */
if (leader->pid == pid && leader->decodeGroupLeader == leader)
{
- if (is_prepared)
- Assert(pid == 0);
/* is the leader going away? */
if (leader->decodeAbortPending)
ok = false;
@@ -2131,10 +2130,13 @@ BecomeDecodeGroupMember(PGPROC *leader, int pid, bool is_prepared)
{
/* OK, join the group */
ok = true;
+ /* XXX unfortunately this does not prevent the race in LockLogicalTransaction :-( */
+ MyProc->decodeLocked = true;
MyProc->decodeGroupLeader = leader;
dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
}
}
+ /* XXX seems unnecessary, considering the assert at the beginning */
else
MyProc->decodeGroupLeader = NULL;
LWLockRelease(leader_lwlock);
Hi Tomas,
1) There's a race condition in LogicalLockTransaction. The code does
roughly this:if (!BecomeDecodeGroupMember(...))
... bail out ...Assert(MyProc->decodeGroupLeader);
lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
...but AFAICS there is no guarantee that the transaction does not commit
(or even abort) right after the become decode group member. In which
case LogicalDecodeRemoveTransaction might have already reset our pointer
to a leader to NULL. In which case the Assert() and lock will fail.I've initially thought this can be fixed by setting decodeLocked=true in
BecomeDecodeGroupMember, but that's not really true - that would fix the
race for aborts, but not commits. LogicalDecodeRemoveTransaction skips
the wait for commits entirely, and just resets the flags anyway.So this needs a different fix, I think. BecomeDecodeGroupMember also
needs the leader PGPROC pointer, but it does not have the issue because
it gets it as a parameter. I think the same thing would work for here
too - that is, use the AssignDecodeGroupLeader() result instead.
That's a good catch. One of the earlier patches had a check for this
(it also had an ill-placed assert above though) which we removed as
part of the ongoing review.
Instead of doing the above, we can just re-check if the
decodeGroupLeader pointer has become NULL and if so, re-assert that
the leader has indeed gone away before returning false. I propose a
diff like below.
/*
* If we were able to add ourself, then Abort processing will
- * interlock with us.
+ * interlock with us. If the leader was done in the meanwhile
+ * it could have removed us and gone away as well.
*/
- Assert(MyProc->decodeGroupLeader);
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ Assert(BackendXidGetProc(txn->xid) == NULL);
+ return false
+ }
2) BecomeDecodeGroupMember sets the decodeGroupLeader=NULL when the
leader does not match the parameters, despite enforcing it by Assert()
at the beginning. Let's remove that assignment.
Ok, done.
3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?
When I wrote this support, I had written it with the intention of
supporting both 2PC (in which case pid is 0) and in-progress regular
transactions. That's why the presence of PID in these functions. The
current use case is just for 2PC, so we could remove it.
4) AssignDecodeGroupLeader still sets pid, which is never read. Remove.
Ok, will do.
5) ReorderBufferCommitInternal does elog(LOG) about interrupting the
decoding of aborted transaction only in one place. There are about three
other places where we check LogicalLockTransaction. Seems inconsistent.
Note that I have added it for the OPTIONAL test_decoding test cases
(which AFAIK we don't plan to commit in that state) which demonstrate
concurrent rollback interlocking with the lock/unlock APIs. The first
ELOG was enough to catch the interaction. If we think these elogs
should be present in the code, then yes, I can add it elsewhere as
well as part of an earlier patch.
6) The comment before LogicalLockTransaction is somewhat inaccurate,
because it talks about adding/removing the backend to the group, but
that's not what's happening. We join the group on the first call and
then we only tweak the decodeLocked flag.
True.
7) I propose minor changes to a couple of comments.
Ok, I am looking at your provided patch and incorporating relevant
changes from it. WIll submit a patch set soon.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 4/5/18 8:50 AM, Nikhil Sontakke wrote:
Hi Tomas,
1) There's a race condition in LogicalLockTransaction. The code does
roughly this:if (!BecomeDecodeGroupMember(...))
... bail out ...Assert(MyProc->decodeGroupLeader);
lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader);
...but AFAICS there is no guarantee that the transaction does not commit
(or even abort) right after the become decode group member. In which
case LogicalDecodeRemoveTransaction might have already reset our pointer
to a leader to NULL. In which case the Assert() and lock will fail.I've initially thought this can be fixed by setting decodeLocked=true in
BecomeDecodeGroupMember, but that's not really true - that would fix the
race for aborts, but not commits. LogicalDecodeRemoveTransaction skips
the wait for commits entirely, and just resets the flags anyway.So this needs a different fix, I think. BecomeDecodeGroupMember also
needs the leader PGPROC pointer, but it does not have the issue because
it gets it as a parameter. I think the same thing would work for here
too - that is, use the AssignDecodeGroupLeader() result instead.That's a good catch. One of the earlier patches had a check for this
(it also had an ill-placed assert above though) which we removed as
part of the ongoing review.Instead of doing the above, we can just re-check if the
decodeGroupLeader pointer has become NULL and if so, re-assert that
the leader has indeed gone away before returning false. I propose a
diff like below./*
* If we were able to add ourself, then Abort processing will
- * interlock with us.
+ * interlock with us. If the leader was done in the meanwhile
+ * it could have removed us and gone away as well.
*/
- Assert(MyProc->decodeGroupLeader);
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ Assert(BackendXidGetProc(txn->xid) == NULL);
+ return false
+ }
Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously
does not fix the race condition - it might get NULL right after the
check. So we need to either lookup the PROC again (and then get the
associated lwlock), or hold some other type of lock.
3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?When I wrote this support, I had written it with the intention of
supporting both 2PC (in which case pid is 0) and in-progress regular
transactions. That's why the presence of PID in these functions. The
current use case is just for 2PC, so we could remove it.
Sure, but why do we need to cross-check the PID at all? I may be missing
something here, but I don't see what does this protect against?
5) ReorderBufferCommitInternal does elog(LOG) about interrupting the
decoding of aborted transaction only in one place. There are about three
other places where we check LogicalLockTransaction. Seems inconsistent.Note that I have added it for the OPTIONAL test_decoding test cases
(which AFAIK we don't plan to commit in that state) which demonstrate
concurrent rollback interlocking with the lock/unlock APIs. The first
ELOG was enough to catch the interaction. If we think these elogs
should be present in the code, then yes, I can add it elsewhere as
well as part of an earlier patch.
Ah, I see. Makes sense. I've been looking at the patch as a whole and
haven't realized it's part of this piece.
Ok, I am looking at your provided patch and incorporating relevant
changes from it. WIll submit a patch set soon.
OK.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously
does not fix the race condition - it might get NULL right after the
check. So we need to either lookup the PROC again (and then get the
associated lwlock), or hold some other type of lock.
I realized my approach was short-sighted while coding it up. So now we
lookup the leader pgproc, recheck if the XID is the same that we are
interested in and go ahead.
3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?When I wrote this support, I had written it with the intention of
supporting both 2PC (in which case pid is 0) and in-progress regular
transactions. That's why the presence of PID in these functions. The
current use case is just for 2PC, so we could remove it.Sure, but why do we need to cross-check the PID at all? I may be missing
something here, but I don't see what does this protect against?
The fact that PID is 0 in case of prepared transactions was making me
nervous. So, I had added the assert that pid should only be 0 when
it's a prepared transaction and not otherwise. Anyways, since we are
dealing with only 2PC, I have removed the PID argument now. Also
removed is_prepared argument for the same reason.
Ok, I am looking at your provided patch and incorporating relevant
changes from it. WIll submit a patch set soon.OK.
PFA, latest patch set.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.patchDownload
From cf093e3ef7de0890956042f758e96de4c18875a6 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 12:31:46 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.patchDownload
From 64b3a887910a6a4f04a472db811082aedf2e794c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 14:24:45 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 236 ++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 435 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 ++
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 791 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..95ffd2da54 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,239 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. The lock function
+ * adds the decoding backend into a "decoding group" for the transaction
+ * on the first call. Subsequent calls update a flag indicating whether
+ * the decoding backend may be accessing any catalogs.
+ *
+ * While aborting an in-progress transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+ PGPROC *leader = NULL;
+ PGXACT *pgxact = NULL;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ leader = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get leader as NULL above. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (leader == NULL ||
+ !BecomeDecodeGroupMember(leader, txn->xid))
+ {
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+ }
+ }
+
+ /*
+ * We know the leader was executing this XID a while ago, and we
+ * might have become a member of the decode group as well.
+ * But we have not been holding any locks on PGPROC so it might
+ * have committed/aborted, removed us from the decoding group and
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ */
+ leader = BackendXidGetProc(txn->xid);
+ if (!leader)
+ return false;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if(pgxact->xid != txn->xid)
+ {
+ LWLockRelease(leader_lwlock);
+ return false;
+ }
+
+ /* ok, we are part of the decode group still */
+ Assert(MyProc->decodeGroupLeader &&
+ MyProc->decodeGroupLeader == leader);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock.
+ */
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ /* ok to logically lock this backend */
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+ PGPROC *leader = NULL;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ /*
+ * We're guaranteed to still have a leader here, because we are
+ * in locked mode, so the leader can't just disappear.
+ */
+ leader = MyProc->decodeGroupLeader;
+ Assert(leader && MyProc->decodeLocked);
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ RemoveDecodeGroupMemberLocked(leader);
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..b95ccb1017 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,9 +843,14 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
+ *
+ * The below code needs to be mindful of the presence of decode group
+ * entries in case of logical decoding. However, lock groups are for
+ * parallel workers so we typically won't be finding both present
+ * together in the same proc.
*/
if (MyProc->lockGroupLeader != NULL)
{
@@ -845,11 +867,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +887,54 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ * by the lockGroup code
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -881,9 +959,36 @@ ProcKill(int code, Datum arg)
/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
Assert(dlist_is_empty(&proc->lockGroupMembers));
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist.
+ * Again check if decode group stuff will handle it later.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist
+ * But check if it was already done above by the lockGroup code
+ */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1992,315 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (!proc)
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ if (proc)
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, TransactionId xid)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+ volatile PGXACT *pgxact;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* XID must be valid */
+ Assert(TransactionIdIsValid(xid));
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if (pgxact->xid == xid && leader->decodeGroupLeader == leader)
+ {
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ LWLockRelease(leader_lwlock);
+
+ if (ok)
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* mark the proc to indicate abort is pending */
+ if (!proc->decodeAbortPending)
+ {
+ proc->decodeAbortPending = true;
+ elog(DEBUG1, "marking group member (%p) from (%p) for abort",
+ proc, leader);
+ }
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..ae842b64d0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, TransactionId pid);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.patchDownload
From a078ffe96dbc36d21c592494ea954776337b5bf4 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 14:25:21 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++--
src/backend/replication/logical/logical.c | 202 +++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 225 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 783 insertions(+), 37 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..b11752789d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -640,7 +757,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..30ebe5e72d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1507,6 +1507,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 95ffd2da54..7c6c8b0df3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,51 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..1c7dbd3ade 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1277,25 +1282,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1372,8 +1370,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1388,8 +1390,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ break;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1418,8 +1426,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ break;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1492,10 +1515,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
case REORDER_BUFFER_CHANGE_MESSAGE:
+ /* XXX does rb->message need lock/unlock? */
+ if (!LogicalLockTransaction(txn))
+ break;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1581,8 +1608,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1654,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1693,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1896,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2810,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..d890e6628c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.patchDownload
From 0b0cbab82da81da15d0acd52281a1e987a435c6f Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 14:25:53 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.patchapplication/octet-stream; name=0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.patchDownload
From 93d49e13b1ba52726fcbfa5f47c287e2051db575 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 14:27:53 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 102 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 135 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..d50e2c9940
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..db7becdc44 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified, sleep with above lock held */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1c7dbd3ade..adb6adef88 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1372,7 +1372,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
break;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.1 (Apple Git-101)
Hi,
Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously
does not fix the race condition - it might get NULL right after the
check. So we need to either lookup the PROC again (and then get the
associated lwlock), or hold some other type of lock.I realized my approach was short-sighted while coding it up. So now we
lookup the leader pgproc, recheck if the XID is the same that we are
interested in and go ahead.
I did some more gdb single-stepping and debugging on this. Introduced a few
more fetch pgproc using XID calls for more robustness. I am satisfied now from
my point of view with the decodegroup lock changes.
Also a few other changes related to cleanups and setting of the txn flags at
all places.
PFA, v2.0 of the patchset for today.
"make check-world" passes ok on these patches.
Regards,
Nikhils
3) I don't quite understand why BecomeDecodeGroupMember does the
cross-check using PID. In which case would it help?When I wrote this support, I had written it with the intention of
supporting both 2PC (in which case pid is 0) and in-progress regular
transactions. That's why the presence of PID in these functions. The
current use case is just for 2PC, so we could remove it.Sure, but why do we need to cross-check the PID at all? I may be missing
something here, but I don't see what does this protect against?The fact that PID is 0 in case of prepared transactions was making me
nervous. So, I had added the assert that pid should only be 0 when
it's a prepared transaction and not otherwise. Anyways, since we are
dealing with only 2PC, I have removed the PID argument now. Also
removed is_prepared argument for the same reason.Ok, I am looking at your provided patch and incorporating relevant
changes from it. WIll submit a patch set soon.OK.
PFA, latest patch set.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.v2.0.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.v2.0.patchDownload
From 4ba56494dca0a6f40bb2224ad344484f8bb2be79 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 19:35:34 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++-----------
src/include/replication/reorderbuffer.h | 37 +++++++++++++------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b4016ed52b..3c9af58640 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
if (prev_first_lsn != InvalidXLogRecPtr)
Assert(prev_first_lsn < cur_txn->first_lsn);
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
#endif
@@ -654,7 +654,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -688,9 +688,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
}
- else if (!subtxn->is_known_as_subxact)
+ else if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -751,9 +751,9 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
subtxn->final_lsn = commit_lsn;
subtxn->end_lsn = end_lsn;
- if (!subtxn->is_known_as_subxact)
+ if (!rbtxn_is_known_subxact(subtxn))
{
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
Assert(subtxn->nsubtxns == 0);
/* remove from lsn order list of top-level transactions */
@@ -862,7 +862,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -891,7 +891,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1057,7 +1057,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1096,7 +1096,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1111,7 +1111,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1128,7 +1128,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1708,7 +1708,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -1954,7 +1954,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -1971,7 +1971,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2111,7 +2111,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index aa430c843c..177ef98e43 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -137,21 +137,33 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
- /*
- * Do we know this is a subxact?
- */
- bool is_known_as_subxact;
-
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -214,15 +226,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.1 (Apple Git-101)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.v2.0.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.v2.0.patchDownload
From 70bfc94323ab6252f9046c3faccbcd0de5faa360 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 19:40:20 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 242 ++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 442 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 +
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 804 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3d8ad7ddf8..9bb382bb97 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1017,3 +1017,245 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. The lock function
+ * adds the decoding backend into a "decoding group" for the transaction
+ * on the first call. Subsequent calls update a flag indicating whether
+ * the decoding backend may be accessing any catalogs.
+ *
+ * While aborting an in-progress transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+ volatile PGPROC *leader = NULL;
+ volatile PGXACT *pgxact = NULL;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ leader = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get leader as NULL above. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (leader == NULL ||
+ !BecomeDecodeGroupMember((PGPROC *)leader, txn->xid))
+ goto lock_cleanup;
+ }
+
+ /*
+ * We know the leader was executing this XID a while ago, and we
+ * might have become a member of the decode group as well.
+ * But we have not been holding any locks on PGPROC so it might
+ * have committed/aborted, removed us from the decoding group and
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ */
+ leader = BackendXidGetProc(txn->xid);
+ if (!leader)
+ goto lock_cleanup;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if(pgxact->xid != txn->xid)
+ {
+ LWLockRelease(leader_lwlock);
+ goto lock_cleanup;
+ }
+
+ /* ok, we are part of the decode group still */
+ Assert(MyProc->decodeGroupLeader &&
+ MyProc->decodeGroupLeader == leader);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock.
+ */
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ /* ok to logically lock this backend */
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+
+ /*
+ * if we reach lock_cleanup label, then lock was not granted.
+ * Check XID status and update txn flags appropriately before
+ * returning
+ */
+lock_cleanup:
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+ PGPROC *leader = NULL;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ /*
+ * We're guaranteed to still have a leader here, because we are
+ * in locked mode, so the leader can't just disappear.
+ */
+ leader = MyProc->decodeGroupLeader;
+ Assert(leader && MyProc->decodeLocked);
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ RemoveDecodeGroupMemberLocked(leader);
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2be2910207 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2432,6 +2432,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..82a2450319 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,9 +843,14 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
+ *
+ * The below code needs to be mindful of the presence of decode group
+ * entries in case of logical decoding. However, lock groups are for
+ * parallel workers so we typically won't be finding both present
+ * together in the same proc.
*/
if (MyProc->lockGroupLeader != NULL)
{
@@ -845,11 +867,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +887,54 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ * by the lockGroup code
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -881,9 +959,36 @@ ProcKill(int code, Datum arg)
/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
Assert(dlist_is_empty(&proc->lockGroupMembers));
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist.
+ * Again check if decode group stuff will handle it later.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist
+ * But check if it was already done above by the lockGroup code
+ */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1992,322 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (!proc)
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+ volatile PGPROC *leader;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ if (proc)
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, TransactionId xid)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+ volatile PGXACT *pgxact;
+ volatile PGPROC *proc = NULL;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* XID must be valid */
+ Assert(TransactionIdIsValid(xid));
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ proc = BackendXidGetProc(xid);
+ if (!proc || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if (pgxact->xid == xid)
+ {
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ LWLockRelease(leader_lwlock);
+
+ if (ok)
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 619c5f4d73..63b14367f0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 177ef98e43..385bb486bb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -141,6 +141,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -154,6 +159,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..ae842b64d0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, TransactionId pid);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.1 (Apple Git-101)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.v2.0.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.v2.0.patchDownload
From 20ece623faf93722bc18d4c8b71e81f376d5dab7 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 19:43:01 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 +++++++++++++--
src/backend/replication/logical/logical.c | 202 +++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 226 +++++++++++++++++++++---
src/include/replication/logical.h | 11 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 783 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f6b14dccb0..b11752789d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -384,7 +384,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -454,7 +459,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -555,6 +566,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -564,7 +643,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -619,6 +703,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -640,7 +757,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d6e4b7980f..30ebe5e72d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1507,6 +1507,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d5527e..51d544d0f5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -72,6 +73,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -280,16 +283,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -627,9 +647,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -641,6 +742,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9bb382bb97..08052f6846 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -125,6 +135,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -184,8 +195,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->begin = begin_cb_wrapper;
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -697,6 +738,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -734,6 +891,51 @@ change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3c9af58640..178a99d158 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1277,25 +1282,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* the top and subtransactions (using a k-way merge) and replay the changes in
* lsn order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1372,8 +1370,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1388,8 +1390,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1418,8 +1426,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1492,10 +1515,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
case REORDER_BUFFER_CHANGE_MESSAGE:
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1565,7 +1591,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
+change_cleanup:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1581,8 +1607,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1609,7 +1653,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1643,6 +1692,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
@@ -1711,7 +1895,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
- dlist_tail_element(ReorderBufferChange, node, &txn->changes);
+ dlist_tail_element(ReorderBufferChange, node, &txn->changes);
txn->final_lsn = last->lsn;
}
@@ -2625,9 +2809,9 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid
XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
- NameStr(MyReplicationSlot->data.name),
- xid,
- (uint32) (recptr >> 32), (uint32) recptr);
+ NameStr(MyReplicationSlot->data.name),
+ xid,
+ (uint32) (recptr >> 32), (uint32) recptr);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 63b14367f0..fbe18dff56 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -46,11 +46,11 @@ typedef struct LogicalDecodingContext
struct SnapBuild *snapshot_builder;
/*
- * Marks the logical decoding context as fast forward decoding one.
- * Such a context does not have plugin loaded so most of the the following
+ * Marks the logical decoding context as fast forward decoding one. Such a
+ * context does not have plugin loaded so most of the the following
* properties are unused.
*/
- bool fast_forward;
+ bool fast_forward;
OutputPluginCallbacks callbacks;
OutputPluginOptions options;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 82875d6b3d..5254210a46 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -68,6 +68,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -99,7 +139,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeBeginCB begin_cb;
LogicalDecodeChangeCB change_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 385bb486bb..d890e6628c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -179,6 +180,9 @@ typedef struct ReorderBufferTXN
*/
TransactionId xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -312,6 +316,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -347,6 +382,11 @@ struct ReorderBuffer
ReorderBufferBeginCB begin;
ReorderBufferApplyChangeCB apply_change;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -394,6 +434,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -417,6 +462,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
void ReorderBufferSetRestartPoint(ReorderBuffer *, XLogRecPtr ptr);
--
2.15.1 (Apple Git-101)
0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.v2.0.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.v2.0.patchDownload
From 7d3851db059018c9ebff1a745bfdd752ac36ead7 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 19:43:41 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index a94aeeae29..05b993fd7a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -58,6 +61,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -75,9 +90,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->begin_cb = pg_decode_begin_txn;
cb->change_cb = pg_decode_change;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -97,6 +117,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -178,6 +199,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -246,6 +277,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.1 (Apple Git-101)
0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.v2.0.patchapplication/octet-stream; name=0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.v2.0.patchDownload
From b82a30a2e4c1647ca5ba96124cdfc3d1e19e7b0c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 5 Apr 2018 19:44:58 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 101 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 28 +++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 138 insertions(+), 1 deletion(-)
create mode 100755 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..79b9622600 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -66,3 +66,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100755
index 0000000000..f154c89908
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,101 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 05b993fd7a..6824e11906 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -118,6 +119,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -209,6 +211,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -548,6 +565,17 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /*
+ * if decode_delay is specified, sleep. Note that this
+ * happens with LogicalLockTransaction held from the
+ * decoding infrastructure
+ */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 178a99d158..460035cc76 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1372,7 +1372,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
goto change_cleanup;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.1 (Apple Git-101)
On Fri, Apr 6, 2018 at 12:23 AM, Nikhil Sontakke
<nikhils@2ndquadrant.com> wrote:
Hi,
Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously
does not fix the race condition - it might get NULL right after the
check. So we need to either lookup the PROC again (and then get the
associated lwlock), or hold some other type of lock.I realized my approach was short-sighted while coding it up. So now we
lookup the leader pgproc, recheck if the XID is the same that we are
interested in and go ahead.I did some more gdb single-stepping and debugging on this. Introduced a few
more fetch pgproc using XID calls for more robustness. I am satisfied now from
my point of view with the decodegroup lock changes.Also a few other changes related to cleanups and setting of the txn flags at
all places.PFA, v2.0 of the patchset for today.
"make check-world" passes ok on these patches.
OK, I think this is now committable. The changes are small, fairly
isolated in effect, and I think every objection has been met, partly
by reducing the scope of the changes. By committing this we will allow
plugin authors to start developing 2PC support, which is important in
some use cases.
I therefore intent to commit these patches some time before the
deadline, either in 12 hours or so, or about 24 hours after that
(which would be right up against the deadline by my calculation) ,
depending on some other important obligations I have.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/3/18 18:05, Andrew Dunstan wrote:
Currently we seem to have only two machines doing the cross-version
upgrade checks, which might make it easier to rearrange anything if
necessary.
I think we should think about making this even more general. We could
use some cross-version testing for pg_dump, psql, pg_basebackup,
pg_upgrade, logical replication, and so on. Ideally, we would be able
to run the whole test set against an older version somehow. Lots of
details omitted here, of course. ;-)
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-04-06 21:30:36 +0930, Andrew Dunstan wrote:
OK, I think this is now committable.
The changes are small, fairly isolated in effect, and I think every
objection has been met, partly by reducing the scope of the
changes. By committing this we will allow plugin authors to start
developing 2PC support, which is important in some use cases.I therefore intent to commit these patches some time before the
deadline, either in 12 hours or so, or about 24 hours after that
(which would be right up against the deadline by my calculation) ,
depending on some other important obligations I have.
I object. And I'm negatively surprised that this is even considered.
This is a complicated patch that has been heavily reworked in the last
few days to, among other things, address objections that have first been
made months ago ([1]http://archives.postgresql.org/message-id/20180209211025.d7jxh43fhqnevhji%40alap3.anarazel.de). There we nontrivial bugs less than a day ago. It
has not received a lot of reviews since these changes. This isn't an
area you've previously been involved in to a significant degree.
Greetings,
Andres Freund
[1]: http://archives.postgresql.org/message-id/20180209211025.d7jxh43fhqnevhji%40alap3.anarazel.de
On Sat, Apr 7, 2018 at 1:50 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-04-06 21:30:36 +0930, Andrew Dunstan wrote:
OK, I think this is now committable.
The changes are small, fairly isolated in effect, and I think every
objection has been met, partly by reducing the scope of the
changes. By committing this we will allow plugin authors to start
developing 2PC support, which is important in some use cases.I therefore intent to commit these patches some time before the
deadline, either in 12 hours or so, or about 24 hours after that
(which would be right up against the deadline by my calculation) ,
depending on some other important obligations I have.I object. And I'm negatively surprised that this is even considered.
This is a complicated patch that has been heavily reworked in the last
few days to, among other things, address objections that have first been
made months ago ([1]). There we nontrivial bugs less than a day ago. It
has not received a lot of reviews since these changes. This isn't an
area you've previously been involved in to a significant degree.
No I haven't although I have been spending some time familiarizing
myself with it. Nevertheless, since you object I won't persist.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 10:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 4/3/18 18:05, Andrew Dunstan wrote:
Currently we seem to have only two machines doing the cross-version
upgrade checks, which might make it easier to rearrange anything if
necessary.I think we should think about making this even more general. We could
use some cross-version testing for pg_dump, psql, pg_basebackup,
pg_upgrade, logical replication, and so on. Ideally, we would be able
to run the whole test set against an older version somehow. Lots of
details omitted here, of course. ;-)
Yeah, that's more or less the plan. One way to generalize it might be
to see if ${branch}_SAVED exists and points to a directory with bin
share and lib directories. If so, use it as required to test against
that branch. The buildfarm will make sure that that setting exists.
There are some tricks you have to play with the environment, but it's
basically doable.
Anyway, this is really matter for another thread.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I object. And I'm negatively surprised that this is even considered.
I am also a bit surprised..
This is a complicated patch that has been heavily reworked in the last
few days to, among other things, address objections that have first been
made months ago ([1]). There we nontrivial bugs less than a day ago. It
has not received a lot of reviews since these changes. This isn't an
area you've previously been involved in to a significant degree.
I thought all the points that you had raised in [1] had been met with
satisfactorily. Let me know if that's not the case. The last few days,
the focus was on making the decodegroup locking implementation a bit
more robust.
Anyways, will now wait for the next commitfest/opportunity to try to
get this in.
[1] http://archives.postgresql.org/message-id/20180209211025.d7jxh43fhqnevhji%40alap3.anarazel.de
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 4/9/18 2:01 AM, Nikhil Sontakke wrote:
Anyways, will now wait for the next commitfest/opportunity to try to
get this in.
It looks like this patch should be in the Needs Review state so I have
done that and moved it to the next CF.
Regards,
--
-David
david@pgmasters.net
Anyways, will now wait for the next commitfest/opportunity to try to
get this in.It looks like this patch should be in the Needs Review state so I have
done that and moved it to the next CF.
Thanks David,
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi all,
Anyways, will now wait for the next commitfest/opportunity to try to
get this in.It looks like this patch should be in the Needs Review state so I have
done that and moved it to the next CF.
PFA, patchset updated to take care of bitrot.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 535b27cab47d87ca75e59a71549488b5e8afcad0 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 35 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++---------
2 files changed, 37 insertions(+), 31 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5792cd14a0..133749110e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -643,8 +643,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
-
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -663,7 +662,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -686,7 +685,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -746,7 +745,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -762,7 +761,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -972,7 +971,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1001,7 +1000,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1167,7 +1166,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1208,7 +1207,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1223,7 +1222,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1240,7 +1239,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1852,7 +1851,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2000,7 +1999,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2107,7 +2106,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2124,7 +2123,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2144,7 +2143,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2265,7 +2264,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1f52f6bde7..ec9515d156 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.2 (Apple Git-101.1)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchDownload
From 56dfcb9e0456282e3234db9a645a4d7d5a80efdb Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:16:14 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 242 ++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 442 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 +
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 804 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c2d0e0c723..073aa41be2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1065,3 +1065,245 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. The lock function
+ * adds the decoding backend into a "decoding group" for the transaction
+ * on the first call. Subsequent calls update a flag indicating whether
+ * the decoding backend may be accessing any catalogs.
+ *
+ * While aborting an in-progress transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+ volatile PGPROC *leader = NULL;
+ volatile PGXACT *pgxact = NULL;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ leader = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get leader as NULL above. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (leader == NULL ||
+ !BecomeDecodeGroupMember((PGPROC *)leader, txn->xid))
+ goto lock_cleanup;
+ }
+
+ /*
+ * We know the leader was executing this XID a while ago, and we
+ * might have become a member of the decode group as well.
+ * But we have not been holding any locks on PGPROC so it might
+ * have committed/aborted, removed us from the decoding group and
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ */
+ leader = BackendXidGetProc(txn->xid);
+ if (!leader)
+ goto lock_cleanup;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if(pgxact->xid != txn->xid)
+ {
+ LWLockRelease(leader_lwlock);
+ goto lock_cleanup;
+ }
+
+ /* ok, we are part of the decode group still */
+ Assert(MyProc->decodeGroupLeader &&
+ MyProc->decodeGroupLeader == leader);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock.
+ */
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ /* ok to logically lock this backend */
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+
+ /*
+ * if we reach lock_cleanup label, then lock was not granted.
+ * Check XID status and update txn flags appropriately before
+ * returning
+ */
+lock_cleanup:
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+ PGPROC *leader = NULL;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ /*
+ * We're guaranteed to still have a leader here, because we are
+ * in locked mode, so the leader can't just disappear.
+ */
+ leader = MyProc->decodeGroupLeader;
+ Assert(leader && MyProc->decodeLocked);
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ RemoveDecodeGroupMemberLocked(leader);
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd20497d81..77bf833381 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2440,6 +2440,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..82a2450319 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,9 +843,14 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
+ *
+ * The below code needs to be mindful of the presence of decode group
+ * entries in case of logical decoding. However, lock groups are for
+ * parallel workers so we typically won't be finding both present
+ * together in the same proc.
*/
if (MyProc->lockGroupLeader != NULL)
{
@@ -845,11 +867,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +887,54 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ * by the lockGroup code
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -881,9 +959,36 @@ ProcKill(int code, Datum arg)
/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
Assert(dlist_is_empty(&proc->lockGroupMembers));
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist.
+ * Again check if decode group stuff will handle it later.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist
+ * But check if it was already done above by the lockGroup code
+ */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1992,322 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (!proc)
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+ volatile PGPROC *leader;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ if (proc)
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, TransactionId xid)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+ volatile PGXACT *pgxact;
+ volatile PGPROC *proc = NULL;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* XID must be valid */
+ Assert(TransactionIdIsValid(xid));
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ proc = BackendXidGetProc(xid);
+ if (!proc || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if (pgxact->xid == xid)
+ {
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ LWLockRelease(leader_lwlock);
+
+ if (ok)
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..069eb7a272 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ec9515d156..473ec85a7e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -154,6 +154,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +172,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..ae842b64d0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, TransactionId pid);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.2 (Apple Git-101.1)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 344fdb3d258aeabfc802259dfa24c610a1a06050 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++--
src/backend/replication/logical/logical.c | 202 ++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 218 ++++++++++++++++++++++--
src/include/replication/logical.h | 7 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 777 insertions(+), 32 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..7e9213def2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +646,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +728,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +782,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a9ef1b3d73..8d2bda3cde 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1522,6 +1522,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9c..008958d35e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -281,16 +284,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -633,9 +653,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -647,6 +748,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 073aa41be2..88feb98312 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -127,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -187,8 +198,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -705,6 +746,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -782,6 +939,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 133749110e..20e75bbeb9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1388,25 +1393,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1483,8 +1481,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1499,8 +1501,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1529,8 +1537,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1635,10 +1658,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
case REORDER_BUFFER_CHANGE_MESSAGE:
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1708,7 +1734,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
+change_cleanup:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1724,8 +1750,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1752,7 +1796,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1786,6 +1835,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 069eb7a272..ea8af26a09 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..e4070aa8a2 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -109,7 +149,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 473ec85a7e..8e1fa08b58 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -195,6 +196,9 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -339,6 +343,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -384,6 +419,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -431,6 +471,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -454,6 +499,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.15.2 (Apple Git-101.1)
0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 5e27e1e2e6e80393e9a7cb0fa4deebe323f17b16 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1c439b57b0..140010a8b1 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +65,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +95,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -183,6 +204,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -251,6 +282,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.2 (Apple Git-101.1)
0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patchapplication/octet-stream; name=0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patchDownload
From 2f89cdc0e23b76f0bf7310e86aba958acbfdca80 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:32:16 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 101 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 28 +++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 138 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..f154c89908
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,101 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 140010a8b1..ed0dbff8e2 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -123,6 +124,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -214,6 +216,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -553,6 +570,17 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /*
+ * if decode_delay is specified, sleep. Note that this
+ * happens with LogicalLockTransaction held from the
+ * decoding infrastructure
+ */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 20e75bbeb9..8680d1560d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1483,7 +1483,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
goto change_cleanup;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.2 (Apple Git-101.1)
Hi,
Anyways, will now wait for the next commitfest/opportunity to try to
get this in.It looks like this patch should be in the Needs Review state so I have
done that and moved it to the next CF.PFA, patchset updated to take care of bitrot.
For some reason, the 3rd patch was missing a few lines. Revised patch
set attached.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 11d7579312e3fd3cee5037229aa795af50822631 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 35 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++---------
2 files changed, 37 insertions(+), 31 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5792cd14a0..133749110e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -643,8 +643,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
-
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -663,7 +662,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -686,7 +685,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -746,7 +745,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -762,7 +761,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -972,7 +971,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1001,7 +1000,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1167,7 +1166,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1208,7 +1207,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1223,7 +1222,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1240,7 +1239,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1852,7 +1851,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2000,7 +1999,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2107,7 +2106,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2124,7 +2123,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2144,7 +2143,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2265,7 +2264,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1f52f6bde7..ec9515d156 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.2 (Apple Git-101.1)
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchapplication/octet-stream; name=0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patchDownload
From 97139cac43a85ddd137d4ab1fe4c600b00613ecc Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:16:14 +0530
Subject: [PATCH 2/5] Introduce LogicalLockTransaction/LogicalUnlockTransaction
APIs
When a transaction aborts, it's changes are considered unnecessary
for other transactions. That means the changes may be either cleaned
up by vacuum or removed from HOT chains (thus made inaccessible
through indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts
(where decoding means passing it to ReorderBufferCommit).
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
To prevent aborts concurrent with plugins accessing catalogs, we
introduce an API the output plugins are required to use (when
decoding in-progress transactions only).
Before accessing any catalogs, output plugins are required to call
LogicalLockTransaction and then release it using
LogicalUnlockTransaction. Implementation is via adding support for
decoding groups. Use LockHashPartitionLockByProc on the group leader
to get the LWLock protecting these fields. For prepared and uncommitted
transactions, decoding backends working on the same XID will link
themselves up to the corresponding PGPROC entry (decodeGroupLeader).
They will remove themselves when they are done decoding.
If the prepared or uncommitted transaction decides to abort, then
the decodeGroupLeader will set the decodeAbortPending flag allowing
the decodeGroupMembers to abort their decoding appropriately.
If any of the decode group members errors out then also we remove
that proc from the membership appropriately.
---
src/backend/replication/logical/logical.c | 242 ++++++++++++++++
src/backend/storage/ipc/procarray.c | 39 +++
src/backend/storage/lmgr/README | 46 ++++
src/backend/storage/lmgr/proc.c | 442 +++++++++++++++++++++++++++++-
src/include/replication/logical.h | 2 +
src/include/replication/reorderbuffer.h | 15 +
src/include/storage/proc.h | 26 ++
src/include/storage/procarray.h | 1 +
8 files changed, 804 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c2d0e0c723..073aa41be2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1065,3 +1065,245 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
SpinLockRelease(&MyReplicationSlot->mutex);
}
}
+
+/*
+ * LogicalLockTransaction
+ * Make sure the transaction is not aborted during decoding.
+ *
+ * The logical decoding plugins may need to access catalogs (both system
+ * and user-defined), e.g. to get metadata about tuples, do custom
+ * filtering etc. While decoding committed transactions that is not an
+ * issue, but in-progress transactions may abort while being decoded, in
+ * which case the catalog access may fail in various ways (rows from
+ * aborted transactions are eligible for more aggressive cleanup, may
+ * not be accessible through indexes due to breaking HOT chains etc.).
+ *
+ * To prevent these issues, we need to prevent abort of the transaction
+ * while accessing any catalogs. To enforce that, each decoding backend
+ * has to call LogicalLockTransaction prior to any catalog access, and
+ * then LogicalUnlockTransaction immediately after it. The lock function
+ * adds the decoding backend into a "decoding group" for the transaction
+ * on the first call. Subsequent calls update a flag indicating whether
+ * the decoding backend may be accessing any catalogs.
+ *
+ * While aborting an in-progress transaction, the backend is made to wait
+ * for all current members of the decoding group that may be currently
+ * accessing catalogs (see LogicalDecodeRemoveTransaction). Once the
+ * transaction completes (applies to both abort and commit), the group
+ * is destroyed and is not needed anymore (we can check transaction
+ * status directly, instead).
+ *
+ * The function returns true when it's safe to access catalogs, and
+ * false when the transaction aborted (or is being aborted), in which
+ * case the plugin should stop decoding it.
+ *
+ * The decoding backend joins the decoding group only when actually
+ * needed. For example when the transaction did no catalog changes,
+ * or when it's known to already have committed (or aborted), we can
+ * bail out without joining the group.
+ */
+bool
+LogicalLockTransaction(ReorderBufferTXN *txn)
+{
+ bool ok = false;
+ LWLock *leader_lwlock;
+ volatile PGPROC *leader = NULL;
+ volatile PGXACT *pgxact = NULL;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return true;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs. If it aborted, we can
+ * stop decoding it right away.
+ */
+ if (rbtxn_commit(txn))
+ return true;
+
+ if (rbtxn_rollback(txn))
+ return false;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return true;
+
+ /*
+ * Find the PROC handling this XID and join the decoding group.
+ *
+ * If this is the first call for this XID, we don't know which
+ * PROC is executing the transaction (and acting as a leader).
+ * In that case we need to lookup and possibly also assign
+ * the leader.
+ */
+ if (MyProc->decodeGroupLeader == NULL)
+ {
+ leader = AssignDecodeGroupLeader(txn->xid);
+
+ /*
+ * We have checked if the transaction committed/aborted, but it
+ * is possible the PROC went away since then, in which case we
+ * get leader as NULL above. We recheck transaction status,
+ * expecting it to be either committed or aborted.
+ *
+ * If the PROC is available, add ourself as a member of its
+ * decoding group. Note that we're not holding any locks on PGPROC,
+ * so it's possible the leader disappears, or starts executing
+ * another transaction. In that case we're done.
+ */
+ if (leader == NULL ||
+ !BecomeDecodeGroupMember((PGPROC *)leader, txn->xid))
+ goto lock_cleanup;
+ }
+
+ /*
+ * We know the leader was executing this XID a while ago, and we
+ * might have become a member of the decode group as well.
+ * But we have not been holding any locks on PGPROC so it might
+ * have committed/aborted, removed us from the decoding group and
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ */
+ leader = BackendXidGetProc(txn->xid);
+ if (!leader)
+ goto lock_cleanup;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if(pgxact->xid != txn->xid)
+ {
+ LWLockRelease(leader_lwlock);
+ goto lock_cleanup;
+ }
+
+ /* ok, we are part of the decode group still */
+ Assert(MyProc->decodeGroupLeader &&
+ MyProc->decodeGroupLeader == leader);
+
+ /*
+ * Re-check if we were told to abort by the leader after taking
+ * the above lock.
+ */
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership and return
+ * false so that the decoding plugin also initiates abort
+ * processing
+ */
+ RemoveDecodeGroupMemberLocked(MyProc->decodeGroupLeader);
+ MyProc->decodeLocked = false;
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ ok = false;
+ }
+ else
+ {
+ /* ok to logically lock this backend */
+ MyProc->decodeLocked = true;
+ ok = true;
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+
+ /*
+ * if we reach lock_cleanup label, then lock was not granted.
+ * Check XID status and update txn flags appropriately before
+ * returning
+ */
+lock_cleanup:
+ Assert(!TransactionIdIsInProgress(txn->xid));
+ if (TransactionIdDidCommit(txn->xid))
+ {
+ txn->txn_flags |= RBTXN_COMMIT;
+ return true;
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ return false;
+ }
+}
+
+/*
+ * LogicalUnlockTransaction
+ * Indicate that the logical decoding plugin is done accessing
+ * catalog information.
+ *
+ *
+ * To prevent issues while decoding of in-progress transactions, we
+ * need to prevent abort of the transaction while accessing any catalogs.
+ * To enforce that, each decoding backend has to call
+ * LogicalLockTransaction prior to any catalog access, and then
+ * LogicalUnlockTransaction immediately after it. This unlock function
+ * removes the decoding backend from a "decoding group" for a given
+ * transaction.
+ */
+void
+LogicalUnlockTransaction(ReorderBufferTXN *txn)
+{
+ LWLock *leader_lwlock;
+ PGPROC *leader = NULL;
+
+ /*
+ * If the transaction is known to have aborted, we should have never got
+ * here (the plugin should have interrupted the decoding).
+ */
+ Assert(!rbtxn_rollback(txn));
+
+ /* If it's not locked, we're done. */
+ if (!MyProc->decodeLocked)
+ return;
+
+ /*
+ * Transactions that have not modified catalogs do not need to
+ * join the decoding group.
+ */
+ if (!rbtxn_has_catalog_changes(txn))
+ return;
+
+ /*
+ * Currently, only 2PC transactions can be decoded before commit
+ * (at prepare). So regular transactions are automatically safe.
+ */
+ if (!rbtxn_prepared(txn))
+ return;
+
+ /*
+ * Check commit status. If a transaction already committed, there
+ * is no danger when accessing catalogs.
+ */
+ if (rbtxn_commit(txn))
+ return;
+
+ /*
+ * We're guaranteed to still have a leader here, because we are
+ * in locked mode, so the leader can't just disappear.
+ */
+ leader = MyProc->decodeGroupLeader;
+ Assert(leader && MyProc->decodeLocked);
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ if (leader->decodeAbortPending)
+ {
+ /*
+ * Remove ourself from the decodeGroupMembership
+ */
+ RemoveDecodeGroupMemberLocked(leader);
+
+ txn->txn_flags |= RBTXN_ROLLBACK;
+ }
+ MyProc->decodeLocked = false;
+ LWLockRelease(leader_lwlock);
+ return;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd20497d81..77bf833381 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2440,6 +2440,45 @@ BackendXidGetPid(TransactionId xid)
return result;
}
+/*
+ * BackendXidGetProc -- get a backend's PGPROC given its XID
+ *
+ * Note that it is up to the caller to be sure that the question
+ * remains meaningful for long enough for the answer to be used ...
+ *
+ * Only main transaction Ids are considered.
+ *
+ */
+PGPROC *
+BackendXidGetProc(TransactionId xid)
+{
+ PGPROC *result = NULL;
+ ProcArrayStruct *arrayP = procArray;
+ int index;
+
+ if (xid == InvalidTransactionId) /* never match invalid xid */
+ return 0;
+
+ LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+ for (index = 0; index < arrayP->numProcs; index++)
+ {
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+ volatile PGXACT *pgxact = &allPgXact[pgprocno];
+
+ if (pgxact->xid == xid)
+ {
+ result = proc;
+ break;
+ }
+ }
+
+ LWLockRelease(ProcArrayLock);
+
+ return result;
+}
+
/*
* IsBackendPid -- is a given pid a running backend
*
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 56b0a12a3e..4b4b9c5958 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -679,6 +679,52 @@ worker, and the worker fails to join the lock group unless the given PGPROC
still has the same PID and is still a lock group leader. We assume that
PIDs are not recycled quickly enough for this interlock to fail.
+Decode Group Locking
+--------------------
+
+When decoding in-progress transactions, we need to prevent aborts while
+the decoding processes are accessing catalogs, which might lead to issues
+if the transaction modified some of the catalogs. Currently this applies
+only to two-phase transactions, that may be decoded at PREPARE time, but
+in the future this may be extended to regular transactions too.
+
+To prevent that, the backend executing the abort is made to wait for all
+the decoding backends. We use an infrastructure which is very similar
+to the above group locking to form groups of backends performing logical
+decoding of the same in-progress transaction.
+
+Decode Group locking adds five new members to each PGPROC:
+decodeGroupLeader, decodeGroupMembers, decodeGroupLink, decodeLocked and
+decodeAbortPending. A PGPROC's decodeGroupLeader is NULL for processes
+not involved in logical decoding. When a process wants to decode an
+in-progress transaction then it finds out the PGPROC structure which is
+associated with that transaction ID and makes that PGPROC structure as
+its decodeGroupLeader. The decodeGroupMembers field is only used in the
+leader; it is a list of the member PGPROCs of the decode group (the
+leader and all backends decoding this transaction ID).
+The decodeGroupLink field is the list link for this list. The decoding
+backend marks itself as decodeLocked while it is accessing catalog
+metadata for its decoding requirements via the LogicalLockTransaction
+API. It resets the same via the LogicalUnlockTransaction API.
+
+Meanwhile, if the transaction ID of this in-progress transaction decides
+to abort, then the PGPROC corresponding to it sets decodeAbortPending
+on itself and also on all the decodeGroupMembers entries.
+
+The decodeGroupMembers entries stop decoding this transaction and exit.
+When all the decoding backends have exited the abort can proceed.
+
+All five of these fields are considered to be protected by a lock manager
+partition lock. The partition lock that protects these fields within a given
+lock group is chosen by taking the leader's pgprocno modulo the number of lock
+manager partitions. Holding this single lock allows safe manipulation of the
+decodeGroupMembers list for the lock group.
+
+The decodeGroupLeader's PGPROC and also its PID is accessible to each
+decoding backend. And the decoding backend fails to join the decode
+lock group unless the given PGPROC still has the same PID and is still
+a decode group leader. We assume that PIDs are not recycled quickly
+enough for this interlock to fail.
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..82a2450319 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -267,6 +267,11 @@ InitProcGlobal(void)
/* Initialize lockGroupMembers list. */
dlist_init(&procs[i].lockGroupMembers);
+
+ /* Initialize decodeGroupMembers list. */
+ dlist_init(&procs[i].decodeGroupMembers);
+ procs[i].decodeAbortPending = false;
+ procs[i].decodeLocked = false;
}
/*
@@ -406,6 +411,12 @@ InitProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/* Initialize wait event information. */
MyProc->wait_event_info = 0;
@@ -581,6 +592,12 @@ InitAuxiliaryProcess(void)
Assert(MyProc->lockGroupLeader == NULL);
Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+ /* Check that group decode fields are in a proper initial state. */
+ Assert(MyProc->decodeGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->decodeGroupMembers));
+ MyProc->decodeAbortPending = false;
+ MyProc->decodeLocked = false;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -826,9 +843,14 @@ ProcKill(int code, Datum arg)
/*
* Detach from any lock group of which we are a member. If the leader
- * exist before all other group members, it's PGPROC will remain allocated
+ * exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
+ *
+ * The below code needs to be mindful of the presence of decode group
+ * entries in case of logical decoding. However, lock groups are for
+ * parallel workers so we typically won't be finding both present
+ * together in the same proc.
*/
if (MyProc->lockGroupLeader != NULL)
{
@@ -845,11 +867,19 @@ ProcKill(int code, Datum arg)
{
procgloballist = leader->procgloballist;
- /* Leader exited first; return its PGPROC. */
- SpinLockAcquire(ProcStructLock);
- leader->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = leader;
- SpinLockRelease(ProcStructLock);
+ /*
+ * Leader exited first; return its PGPROC.
+ * Only do this if it does not have any decode
+ * group members though. Otherwise that will
+ * release it later
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
}
}
else if (leader != MyProc)
@@ -857,6 +887,54 @@ ProcKill(int code, Datum arg)
LWLockRelease(leader_lwlock);
}
+ /*
+ * Detach from any decode group of which we are a member. If the leader
+ * exits before all other group members, its PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->decodeGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->decodeGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ if (dlist_is_empty(&leader->decodeGroupMembers))
+ {
+ leader->decodeGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /*
+ * Leader exited first; return its PGPROC.
+ * But check if it was already done above
+ * by the lockGroup code
+ */
+ if (leader != *procgloballist)
+ {
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ /* clear leader flags */
+ leader->decodeAbortPending = false;
+ leader->decodeLocked = false;
+ }
+ else if (leader != MyProc)
+ {
+ MyProc->decodeGroupLeader = NULL;
+ /* clear proc flags */
+ MyProc->decodeLocked = false;
+ MyProc->decodeAbortPending = false;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -881,9 +959,36 @@ ProcKill(int code, Datum arg)
/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
Assert(dlist_is_empty(&proc->lockGroupMembers));
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist.
+ * Again check if decode group stuff will handle it later.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
+ }
+
+ /*
+ * If we're still a member of a decode group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /* Since decodeGroupLeader is NULL, decodeGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->decodeGroupMembers));
+
+ /*
+ * Return PGPROC structure (and semaphore) to appropriate freelist
+ * But check if it was already done above by the lockGroup code
+ */
+ if (proc != *procgloballist)
+ {
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
}
/* Update shared estimate of spins_per_delay */
@@ -1887,3 +1992,322 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
return ok;
}
+
+/*
+ * AssignDecodeGroupLeader
+ * Lookup process using xid and designate as decode group leader.
+ *
+ * Once this function has returned, other processes can join the decode
+ * group by calling BecomeDecodeGroupMember.
+ */
+PGPROC *
+AssignDecodeGroupLeader(TransactionId xid)
+{
+ PGPROC *proc = NULL;
+ LWLock *leader_lwlock;
+
+ Assert(xid != InvalidTransactionId);
+
+ /*
+ * Lookup the backend executing this transaction.
+ *
+ * If the transaction already completed, we can bail out.
+ */
+ proc = BackendXidGetProc(xid);
+ if (!proc)
+ return NULL;
+
+ /*
+ * Process running a XID can't have a leader, it can only be
+ * a leader (in which case it points to itself).
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /*
+ * This proc will become decodeGroupLeader if it's not already.
+ */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ volatile PGXACT *pgxact;
+ volatile PGPROC *leader;
+
+ /* Create single-member group, containing this proc. */
+ leader_lwlock = LockHashPartitionLockByProc(proc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* recheck we are still the same */
+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+
+ /*
+ * We know the process was executing the XID a while ago, but we
+ * have not been holding any locks on PGPROC so it might have
+ * started executing something else since then. So we need to
+ * recheck that it is indeed still running the right XID.
+ *
+ * If it's not, the transaction must have already completed, so
+ * we don't need to create any decoding group.
+ */
+ if (pgxact->xid == xid)
+ {
+ /*
+ * Some other decoding backend might have marked the process
+ * as a leader before we acquired the lock. But it must not
+ * be a follower of some other leader.
+ */
+ Assert(!proc->decodeGroupLeader ||
+ (proc->decodeGroupLeader == proc));
+
+ /* recheck if someone else did not already assign us */
+ if (proc->decodeGroupLeader == NULL)
+ {
+ /*
+ * The leader is also a part of the decoding group,
+ * so we add it to the members list as well.
+ */
+ proc->decodeGroupLeader = proc;
+ dlist_push_head(&proc->decodeGroupMembers,
+ &proc->decodeGroupLink);
+ }
+ }
+ else
+ {
+ /* proc entry is gone */
+ proc = NULL;
+ }
+ LWLockRelease(leader_lwlock);
+ }
+
+ if (proc)
+ elog(DEBUG1, "became group leader (%p)", proc);
+ return proc;
+}
+
+/*
+ * BecomeDecodeGroupMember - designate process as decode group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeDecodeGroupMember(PGPROC *leader, TransactionId xid)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+ volatile PGXACT *pgxact;
+ volatile PGPROC *proc = NULL;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* Can't already be a member of a group */
+ Assert(MyProc->decodeGroupLeader == NULL);
+
+ /* XID must be valid */
+ Assert(TransactionIdIsValid(xid));
+
+ /*
+ * Get lock protecting the group fields. Note LockHashPartitionLockByProc
+ * accesses leader->pgprocno in a PGPROC that might be free. This is safe
+ * because all PGPROCs' pgprocno fields are set during shared memory
+ * initialization and never change thereafter; so we will acquire the
+ * correct lock even if the leader PGPROC is in process of being recycled.
+ */
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /* Is this the leader we're looking for? */
+ proc = BackendXidGetProc(xid);
+ if (!proc || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
+ pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
+ if (pgxact->xid == xid)
+ {
+ /* is the leader going away? */
+ if (leader->decodeAbortPending)
+ ok = false;
+ else
+ {
+ /* OK, join the group */
+ ok = true;
+ MyProc->decodeGroupLeader = leader;
+ dlist_push_tail(&leader->decodeGroupMembers, &MyProc->decodeGroupLink);
+ }
+ }
+ LWLockRelease(leader_lwlock);
+
+ if (ok)
+ elog(DEBUG1, "became group member (%p) to (%p)", MyProc, leader);
+ return ok;
+}
+
+/*
+ * RemoveDecodeGroupMember
+ * Remove a member from the decoding group of a leader.
+ */
+void
+RemoveDecodeGroupMember(PGPROC *leader)
+{
+ LWLock *leader_lwlock;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ RemoveDecodeGroupMemberLocked(leader);
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
+
+/*
+ * RemoveDecodeGroupMemberLocked
+ * Remove a member from a decoding group of a leader.
+ *
+ * Assumes that the caller is holding appropriate lock on PGPROC.
+ */
+void
+RemoveDecodeGroupMemberLocked(PGPROC *leader)
+{
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_delete(&MyProc->decodeGroupLink);
+ /* leader links to itself, so never empty */
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ MyProc->decodeGroupLeader = NULL;
+ elog(DEBUG1, "removed group member (%p) from (%p)", MyProc, leader);
+
+ return;
+}
+
+/*
+ * LogicalDecodeRemoveTransaction
+ * Notify all decoding members that this transaction is going away.
+ *
+ * Wait for all decodeGroupMembers to ack back before returning from
+ * here but only in case of aborts.
+ *
+ * This function should be called *after* the proc has been removed
+ * from the procArray.
+ *
+ * If the transaction is committing, it's ok for the decoding backends
+ * to continue merrily - there is no danger in accessing catalogs. When
+ * it tries to join the decoding group, it won't find the proc anymore,
+ * forcing it to re-check transaction status and cache the commit
+ * status for future calls (see LogicalLockTransaction).
+ *
+ * In case a backend which is part of the decode group dies/crashes,
+ * then that would effectively cause the database to restart cleaning
+ * up the shared memory state
+ */
+void
+LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit)
+{
+ LWLock *leader_lwlock;
+ dlist_mutable_iter change_i;
+ dlist_iter iter;
+ PGPROC *proc;
+ bool do_wait;
+
+ leader_lwlock = LockHashPartitionLockByProc(leader);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+
+ /*
+ * If the proc has not been initialized as a group leader, there are
+ * no group members to wait for and we can terminate right away.
+ */
+ if (leader->decodeGroupLeader == NULL)
+ {
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ LWLockRelease(leader_lwlock);
+ return;
+ }
+
+ /* mark the transaction as aborting */
+ leader->decodeAbortPending = (!isCommit);
+
+recheck:
+ do_wait = false;
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ if (!isCommit)
+ {
+ /*
+ * We need to walk the list of group members, and decide if we
+ * need to wait for some of them. In other words, we need to
+ * check if there are any processes besides the leader.
+ */
+ dlist_foreach(iter, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, iter.cur);
+
+ /* Ignore the leader (i.e. ourselves). */
+ if (proc == leader)
+ continue;
+
+ /* if the proc is currently locked, wait */
+ if (proc->decodeLocked)
+ do_wait = true;
+ }
+
+ if (do_wait)
+ {
+ int rc;
+ LWLockRelease(leader_lwlock);
+
+ elog(LOG, "Waiting for backends to abort decoding");
+ /*
+ * Wait on our latch to allow decodeGroupMembers to
+ * go away soon
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ 100L,
+ WAIT_EVENT_PG_SLEEP);
+ ResetLatch(MyLatch);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Recheck decodeGroupMembers */
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ goto recheck;
+ }
+ }
+
+ /*
+ * All backends exited cleanly in case of aborts above,
+ * remove decodeGroupMembers now for both commit/abort cases
+ */
+ Assert(leader->decodeGroupLeader == leader);
+ Assert(!dlist_is_empty(&leader->decodeGroupMembers));
+ dlist_foreach_modify(change_i, &leader->decodeGroupMembers)
+ {
+ proc = dlist_container(PGPROC, decodeGroupLink, change_i.cur);
+ Assert(!proc->decodeLocked);
+ dlist_delete(&proc->decodeGroupLink);
+ elog(DEBUG1, "deleting group member (%p) from (%p)",
+ proc, leader);
+ proc->decodeGroupLeader = NULL;
+ }
+ Assert(dlist_is_empty(&leader->decodeGroupMembers));
+ leader->decodeGroupLeader = NULL;
+ leader->decodeAbortPending = false;
+ LWLockRelease(leader_lwlock);
+
+ return;
+}
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..069eb7a272 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -117,6 +117,8 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr lsn, TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalLockTransaction(ReorderBufferTXN *txn);
+extern void LogicalUnlockTransaction(ReorderBufferTXN *txn);
extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ec9515d156..473ec85a7e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -154,6 +154,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +172,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..ae842b64d0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -200,6 +200,26 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
+
+ /*
+ * Support for decoding groups. Use LockHashPartitionLockByProc on the group
+ * leader to get the LWLock protecting these fields.
+ *
+ * For prepared and uncommitted transactions, decoding backends working on
+ * the same XID will link themselves up to the corresponding PGPROC
+ * entry (decodeGroupLeader).
+ *
+ * They will remove themselves when they are done decoding.
+ *
+ * If the prepared or uncommitted transaction decides to abort, then
+ * the decodeGroupLeader will set the decodeAbortPending flag allowing
+ * the decodeGroupMembers to abort their decoding appropriately
+ */
+ PGPROC *decodeGroupLeader; /* decode group leader, if I'm a member */
+ dlist_head decodeGroupMembers; /* list of members, if I'm a leader */
+ dlist_node decodeGroupLink; /* my member link, if I'm a member */
+ bool decodeLocked; /* is it currently locked by this proc? */
+ bool decodeAbortPending; /* is the decode group leader aborting? */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -327,4 +347,10 @@ extern PGPROC *AuxiliaryPidGetProc(int pid);
extern void BecomeLockGroupLeader(void);
extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+extern PGPROC *AssignDecodeGroupLeader(TransactionId xid);
+extern bool BecomeDecodeGroupMember(PGPROC *leader, TransactionId pid);
+extern void RemoveDecodeGroupMember(PGPROC *leader);
+extern void RemoveDecodeGroupMemberLocked(PGPROC *leader);
+extern void LogicalDecodeRemoveTransaction(PGPROC *leader, bool isCommit);
+
#endif /* PROC_H */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..776de2470e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -97,6 +97,7 @@ extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids
extern PGPROC *BackendPidGetProc(int pid);
extern PGPROC *BackendPidGetProcWithLock(int pid);
+extern PGPROC *BackendXidGetProc(TransactionId xid);
extern int BackendXidGetPid(TransactionId xid);
extern bool IsBackendPid(int pid);
--
2.15.2 (Apple Git-101.1)
0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 6df02b378f3958677c1c0193793f491bce0919da Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 3/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
All catalog access while decoding of such 2PC has to be carried out
via the use of LogicalLockTransaction/LogicalUnlockTransaction APIs
at relevant locations. This includes the location where the output
plugin's change apply API is to be invoked. This protects any catalog
access inside the output plugin's change apply API from concurrent
rollback operations.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 +++++++++++++-
src/backend/access/transam/twophase.c | 8 +
src/backend/replication/logical/decode.c | 147 ++++++++++++++--
src/backend/replication/logical/logical.c | 202 ++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 221 ++++++++++++++++++++++--
src/include/replication/logical.h | 7 +-
src/include/replication/output_plugin.h | 45 +++++
src/include/replication/reorderbuffer.h | 54 ++++++
8 files changed, 780 insertions(+), 32 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..7e9213def2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +646,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +728,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +782,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a9ef1b3d73..8d2bda3cde 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1522,6 +1522,14 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
ProcArrayRemove(proc, latestXid);
+ /*
+ * Coordinate with logical decoding backends that may be already
+ * decoding this prepared transaction. When aborting a transaction,
+ * we need to wait for all of them to leave the decoding group. If
+ * committing, we simply remove all members from the group.
+ */
+ LogicalDecodeRemoveTransaction(proc, isCommit);
+
/*
* In case we fail while running the callbacks, mark the gxact invalid so
* no one else will try to commit/rollback, and so it will be recycled if
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9c..008958d35e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -281,16 +284,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -633,9 +653,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -647,6 +748,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 073aa41be2..88feb98312 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -127,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -187,8 +198,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -705,6 +746,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -782,6 +939,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 133749110e..628e7d3493 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1388,25 +1393,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1483,8 +1481,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
case REORDER_BUFFER_CHANGE_DELETE:
Assert(snapshot_now);
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
+ LogicalUnlockTransaction(txn);
/*
* Catalog tuple without data, emitted while catalog was
@@ -1499,8 +1501,14 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relpathperm(change->data.tp.relnode,
MAIN_FORKNUM));
+ /* Lock transaction before catalog access */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
+
relation = RelationIdGetRelation(reloid);
+ LogicalUnlockTransaction(txn);
+
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
reloid,
@@ -1529,8 +1537,23 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
/* user-triggered change */
if (!IsToastRelation(relation))
{
+ /*
+ * Output plugins can access catalog metadata and we
+ * do not have any control over that. We could ask
+ * them to call
+ * LogicalLockTransaction/LogicalUnlockTransaction
+ * APIs themselves, but that leads to unnecessary
+ * complications and expectations from plugin
+ * writers. We avoid this by calling these APIs
+ * here, thereby ensuring that the in-progress
+ * transaction will be around for the duration of
+ * the apply_change call below
+ */
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);
+ LogicalUnlockTransaction(txn);
/*
* Only clear reassembled toast chunks if we're sure
@@ -1615,7 +1638,10 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Oid relid = change->data.truncate.relids[i];
Relation relation;
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
relation = RelationIdGetRelation(relid);
+ LogicalUnlockTransaction(txn);
if (relation == NULL)
elog(ERROR, "could not open relation with OID %u", relid);
@@ -1635,10 +1661,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
}
case REORDER_BUFFER_CHANGE_MESSAGE:
+ if (!LogicalLockTransaction(txn))
+ goto change_cleanup;
rb->message(rb, txn, change->lsn, true,
change->data.msg.prefix,
change->data.msg.message_size,
change->data.msg.message);
+ LogicalUnlockTransaction(txn);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
@@ -1708,7 +1737,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
+change_cleanup:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1724,8 +1753,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
+
+ /* remove ourself from the decodeGroupLeader */
+ if (MyProc->decodeGroupLeader)
+ RemoveDecodeGroupMember(MyProc->decodeGroupLeader);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1752,7 +1799,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1786,6 +1838,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 069eb7a272..ea8af26a09 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..e4070aa8a2 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -109,7 +149,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 473ec85a7e..8e1fa08b58 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -195,6 +196,9 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
+
/*
* LSN of the first data carrying, WAL record with knowledge about this
* xid. This is allowed to *not* be first record adorned with this xid, if
@@ -339,6 +343,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -384,6 +419,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -431,6 +471,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -454,6 +499,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.15.2 (Apple Git-101.1)
0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 5032421d2a2ba58f947cb65f9b379e3f512e88c0 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1c439b57b0..140010a8b1 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +65,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +95,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -183,6 +204,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -251,6 +282,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.2 (Apple Git-101.1)
0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patchapplication/octet-stream; name=0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patchDownload
From 3b8424323bc0a38fdbf332e55ba0b9c6d00d25ba Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:32:16 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided the plugin sleep for those many seconds while
holding the LogicalTransactionLock. A concurrent rollback is fired
off which aborts that transaction in the meanwhile.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 101 ++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 28 +++++++
src/backend/replication/logical/reorderbuffer.c | 5 ++
4 files changed, 138 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..f154c89908
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,101 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode is
+# ongoing. The decode-delay value will allow for each change decode to sleep for
+# those many seconds. We also hold the LogicalLockTransaction while we sleep.
+# We will fire off a ROLLBACK from another session when this delayed decode is
+# ongoing. Since we are holding the lock from the call above, this ROLLBACK
+# will wait for the logical backends to do a LogicalUnlockTransaction. We will
+# stop decoding immediately post this and the next pg_logical_slot_get_changes call
+# should show only a few records decoded from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 1 INSERT record and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about waiting backends
+my $output_file = slurp_file($node_logical->logfile());
+my $waiting_str = "Waiting for backends to abort";
+like($output_file, qr/$waiting_str/, "Waiting log found in server log");
+
+# check for occurrence of log about stopping decoding
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 140010a8b1..ed0dbff8e2 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -123,6 +124,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -214,6 +216,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -553,6 +570,17 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /*
+ * if decode_delay is specified, sleep. Note that this
+ * happens with LogicalLockTransaction held from the
+ * decoding infrastructure
+ */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 628e7d3493..0b97211a64 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1483,7 +1483,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* Lock transaction before catalog access */
if (!LogicalLockTransaction(txn))
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"",
+ txn->xid);
goto change_cleanup;
+ }
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
change->data.tp.relnode.relNode);
LogicalUnlockTransaction(txn);
--
2.15.2 (Apple Git-101.1)
Hi Nikhil,
I've been looking at this patch series, and I do have a bunch of
comments and questions, as usual ;-)
Overall, I think it's clear the main risk associated with this patch is
the decode group code - it touches PROC entries, so a bug may cause
trouble pretty easily. So I've focused on this part, for now.
1) LogicalLockTransaction does roughly this
...
if (MyProc->decodeGroupLeader == NULL)
{
leader = AssignDecodeGroupLeader(txn->xid);
if (leader == NULL ||
!BecomeDecodeGroupMember((PGPROC *)leader, txn->xid))
goto lock_cleanup;
}
leader = BackendXidGetProc(txn->xid);
if (!leader)
goto lock_cleanup;
leader_lwlock = LockHashPartitionLockByProc(leader);
LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
if(pgxact->xid != txn->xid)
{
LWLockRelease(leader_lwlock);
goto lock_cleanup;
}
...
I wonder why we need the BackendXidGetProc call after the first if
block. Can we simply grab MyProc->decodeGroupLeader at that point?
2) InitProcess now resets decodeAbortPending/decodeLocked flags, while
checking decodeGroupLeader/decodeGroupMembers using asserts. Isn't that
a bit strange? Shouldn't it do the same thing with both?
3) A comment in ProcKill says this:
* Detach from any decode group of which we are a member. If the leader
* exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.
So I'm wondering what happens if the leader dies before other group
members, but the PROC entry gets reused for a new connection. It clearly
should not be a leader for that old decode group, but it may need to be
a leader for another group.
4) strange hunk in ProcKill
There seems to be some sort of merge/rebase issue, because this block of
code (line ~880) related to lock groups
/* Return PGPROC structure (and semaphore) to appropriate freelist */
proc->links.next = (SHM_QUEUE *) *procgloballist;
*procgloballist = proc;
got replaced by code relared to decode groups. That seems strange.
5) ReorderBufferCommitInternal
I see the LogicalLockTransaction() calls in ReorderBufferCommitInternal
have vastly variable comments. Some calls have no comment, some calls
have "obvious" comment like "Lock transaction before catalog access" and
one call has this very long comment
/*
* Output plugins can access catalog metadata and we
* do not have any control over that. We could ask
* them to call
* LogicalLockTransaction/LogicalUnlockTransaction
* APIs themselves, but that leads to unnecessary
* complications and expectations from plugin
* writers. We avoid this by calling these APIs
* here, thereby ensuring that the in-progress
* transaction will be around for the duration of
* the apply_change call below
*/
I find that rather inconsistent, and I'd say those comments are useless.
I suggest to remove all the per-call comments and instead add a comment
about the locking into the initial file-level comment, which already
explains handling of large transactions, etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Overall, I think it's clear the main risk associated with this patch is the
decode group code - it touches PROC entries, so a bug may cause trouble
pretty easily. So I've focused on this part, for now.
I agree. As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary. It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee. I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.
In particular:
- The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes
it interruptible, then it's possible for the abort to complete before
the decoding processes have aborted. If that can happen, then this
whole mechanism is completely pointless, because it fails to actually
achieve the guarantee which is its central goal. On the other hand,
if you don't make this abort interruptible, then you are significantly
increase the risk that a backend could get stuck in the abort path for
an unbounded period of time. If the aborting backend holds any
significant resources at this point, such as heavyweight locks, then
you risk creating a deadlock that cannot be broken until the decoding
process manages to abort, and if that process is involved in the
deadlock, then you risk creating an unbreakable deadlock.
- BackendXidGetProc() seems to be called in multiple places without
any lock held. I don't see how that can be safe, because AFAICS it
must inevitably introduce a race condition: the answer can change
after that value is returned but before it is used. There's a bunch
of recheck logic that looks like it is trying to cope with this
problem, but I'm not sure it's very solid. For example,
AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding
any lock; we have historically avoided assuming that pointer-width
reads cannot be torn. (We have assumed this only for 4-byte reads or
narrower.) There are no comments about the locking hazards here, and
no real explanation of how the recheck algorithm tries to patch things
up:
+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }
Can be non-NULL yet unequal to proc? I don't understand how that can
happen: surely once the PGPROC that has that XID aborts, the same XID
can't possibly be assigned to a different PGPROC.
- The code for releasing PGPROCs in ProcKill looks completely unsafe
to me. With locking groups for parallel query, a process always
enters a lock group of its own volition. It can safely use
(MyProc->lockGroupLeader != NULL) as a race-free test because no other
process can modify that value. But in this implementation of decoding
groups, one process can put another process into a decoding group,
which means this test has a race condition. If there's some reason
this is safe, the comments sure don't explain it.
I don't want to overplay my hand, but I think this code is a very long
way from being committable, and I am concerned that the fundamental
approach of blocking transaction aborts may be unsalvageably broken or
at least exceedingly dangerous.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I agree. As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary. It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee. I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.
Ugh, is this patch really dependent on such a thing?
TBH, I think the odds of making that work are indistinguishable from zero;
and even if you managed to commit something that did work at the instant
you committed it, the odds that it would stay working in the face of later
system changes are exactly zero. I would reject this idea out of hand.
regards, tom lane
On 07/16/2018 06:15 PM, Robert Haas wrote:
On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Overall, I think it's clear the main risk associated with this patch is the
decode group code - it touches PROC entries, so a bug may cause trouble
pretty easily. So I've focused on this part, for now.I agree. As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary. It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee. I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.In particular:
- The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes
it interruptible, then it's possible for the abort to complete before
the decoding processes have aborted. If that can happen, then this
whole mechanism is completely pointless, because it fails to actually
achieve the guarantee which is its central goal. On the other hand,
if you don't make this abort interruptible, then you are significantly
increase the risk that a backend could get stuck in the abort path for
an unbounded period of time. If the aborting backend holds any
significant resources at this point, such as heavyweight locks, then
you risk creating a deadlock that cannot be broken until the decoding
process manages to abort, and if that process is involved in the
deadlock, then you risk creating an unbreakable deadlock.
I'm not sure I understand. Are you suggesting the process might get
killed or something, thanks to the CHECK_FOR_INTERRUPTS() call?
- BackendXidGetProc() seems to be called in multiple places without
any lock held. I don't see how that can be safe, because AFAICS it
must inevitably introduce a race condition: the answer can change
after that value is returned but before it is used. There's a bunch
of recheck logic that looks like it is trying to cope with this
problem, but I'm not sure it's very solid.
But BackendXidGetProc() internally acquires ProcArrayLock, of course.
It's true there are a few places where we do != NULL checks on the
result without holding any lock, but I don't see why that would be a
problem? And before actually inspecting the contents, the code always
does LockHashPartitionLockByProc.
But I certainly agree this would deserve comments explaining why this
(lack of) locking is safe. (The goal why it's done this way is clearly
an attempt to acquire the lock as infrequently as possible, in an effort
to minimize the overhead.)
For example,
AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding
any lock; we have historically avoided assuming that pointer-width
reads cannot be torn. (We have assumed this only for 4-byte reads or
narrower.) There are no comments about the locking hazards here, and
no real explanation of how the recheck algorithm tries to patch things
up:+ leader = BackendXidGetProc(xid); + if (!leader || leader != proc) + { + LWLockRelease(leader_lwlock); + return NULL; + }Can be non-NULL yet unequal to proc? I don't understand how that can
happen: surely once the PGPROC that has that XID aborts, the same XID
can't possibly be assigned to a different PGPROC.
Yeah. I have the same question.
- The code for releasing PGPROCs in ProcKill looks completely unsafe
to me. With locking groups for parallel query, a process always
enters a lock group of its own volition. It can safely use
(MyProc->lockGroupLeader != NULL) as a race-free test because no other
process can modify that value. But in this implementation of decoding
groups, one process can put another process into a decoding group,
which means this test has a race condition. If there's some reason
this is safe, the comments sure don't explain it.
I don't follow. How could one process put another process into a
decoding group? I don't think that's possible.
I don't want to overplay my hand, but I think this code is a very long
way from being committable, and I am concerned that the fundamental
approach of blocking transaction aborts may be unsalvageably broken or
at least exceedingly dangerous.
I'm not sure about the 'unsalvageable' part, but it needs more work,
that's for sure. Unfortunately, all previous attempts to make this work
in various other ways failed (see past discussions in this thread), so
this is the only approach left :-( So let's see if we can make it work.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/16/2018 07:21 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I agree. As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary. It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee. I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.Ugh, is this patch really dependent on such a thing?
Unfortunately it does :-( Without it the decoding (or output plugins)
may see catalogs broken in various ways - the catalog records may get
vacuumed, HOT chains are broken, ... There were attempts to change that
part, but that seems an order of magnitude more invasive than this.
TBH, I think the odds of making that work are indistinguishable from zero;
and even if you managed to commit something that did work at the instant
you committed it, the odds that it would stay working in the face of later
system changes are exactly zero. I would reject this idea out of hand.
Why? How is this significantly different from other patches touching
ProcArray and related bits?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I'm not sure I understand. Are you suggesting the process might get killed
or something, thanks to the CHECK_FOR_INTERRUPTS() call?
Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local
transfer of control.
But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's
true there are a few places where we do != NULL checks on the result without
holding any lock, but I don't see why that would be a problem? And before
actually inspecting the contents, the code always does
LockHashPartitionLockByProc.
I think at least some of those cases are a problem. See below...
I don't follow. How could one process put another process into a decoding
group? I don't think that's possible.
Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks
up the process that currently has that XID, then turns that process
into a decode group leader. Then after that function returns, the
caller adds itself to the decode group as well. So it seems entirely
possible for somebody to swing the decodeGroupLeader pointer for a
PGPROC from NULL to some other value at an arbitrary point in time.
I'm not sure about the 'unsalvageable' part, but it needs more work, that's
for sure. Unfortunately, all previous attempts to make this work in various
other ways failed (see past discussions in this thread), so this is the only
approach left :-( So let's see if we can make it work.
I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!
After thinking about it a bit more, if you want to try to stick with
this design, I don't think that this decode group leader/members thing
has much to recommend it. In the case of parallel query, the point of
the lock group stuff is to treat all of those processes as one for
purposes of heavyweight lock acquisition. There's no similar need
here, so the design that makes sure the "leader" is in the list of
processes that are members of the "group" is, AFAICS, just wasted
code. All you really need is a list of processes hung off of the
PGPROC that must abort before the leader is allowed to abort; the
leader itself doesn't need to be in the list, and there's no need to
consider it as a "group". It's just a list of waiters.
That having been said, I still don't see how that's really going to
work. Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader. The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker. Then the
deadlock detector would know to abort some transaction. But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then? We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things. Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.
If we're only ever talking about decoding prepared transactions, we
could probably work around all of these problems: have the decoding
process take a heavyweight lock before it begins decoding. Have a
process that wants to execute ROLLBACK PREPARED take a conflicting
heavyweight lock on the same object. The net effect would be that
ROLLBACK PREPARED would simply wait for decoding to finish. That
might be rather lousy from a latency point of view since the
transaction could take an arbitrarily long time to decode, but it
seems safe enough. Possibly you could also design a mechanism for the
ROLLBACK PREPARED command to SIGTERM the processes that are blocking
its lock acquisition, if they are decoding processes. The difference
between this and what you the current patch is doing is that nothing
complex or fragile is happening in the abort pathway itself. The
complicated stuff in both the worker and in the main backend happens
while the transaction is still good and can still be rolled back at
need. This kind of approach won't work if you want to decode
transactions that aren't yet prepared, so if that is the long term
goal then we need to think harder. I'm honestly not sure that problem
has any reasonable solution. The assumption that a running process
can abort at any time is deeply baked into many parts of the system
and for good reasons. Trying to undo that is going to be like trying
to push water up a hill. I think we need to install interlocks in
such a way that any waiting happens before we enter the abort path,
not while we're actually trying to perform the abort. But I don't
know how to do that for a foreground task that's still actively doing
stuff.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/16/2018 08:09 PM, Robert Haas wrote:
On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I'm not sure I understand. Are you suggesting the process might get killed
or something, thanks to the CHECK_FOR_INTERRUPTS() call?Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local
transfer of control.But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's
true there are a few places where we do != NULL checks on the result without
holding any lock, but I don't see why that would be a problem? And before
actually inspecting the contents, the code always does
LockHashPartitionLockByProc.I think at least some of those cases are a problem. See below...
I don't follow. How could one process put another process into a decoding
group? I don't think that's possible.Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks
up the process that currently has that XID, then turns that process
into a decode group leader. Then after that function returns, the
caller adds itself to the decode group as well. So it seems entirely
possible for somebody to swing the decodeGroupLeader pointer for a
PGPROC from NULL to some other value at an arbitrary point in time.
Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).
But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).
But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?
FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.
I'm not sure about the 'unsalvageable' part, but it needs more work, that's
for sure. Unfortunately, all previous attempts to make this work in various
other ways failed (see past discussions in this thread), so this is the only
approach left :-( So let's see if we can make it work.I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!
Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.
After thinking about it a bit more, if you want to try to stick with
this design, I don't think that this decode group leader/members thing
has much to recommend it. In the case of parallel query, the point of
the lock group stuff is to treat all of those processes as one for
purposes of heavyweight lock acquisition. There's no similar need
here, so the design that makes sure the "leader" is in the list of
processes that are members of the "group" is, AFAICS, just wasted
code. All you really need is a list of processes hung off of the
PGPROC that must abort before the leader is allowed to abort; the
leader itself doesn't need to be in the list, and there's no need to
consider it as a "group". It's just a list of waiters.
But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.
The thing is that the lock/unlock happens for each decoded change
independently, and it'd be silly to modify the list all the time, so
instead it just sets the decodeLocked flag to true/false. Similarly,
when the leader decides to abort, it marks decodeAbortPending and waits
for the decoding backends to complete.
Of course, that's my understanding/interpretation, and perhaps Nikhil as
a patch author has a better explanation.
That having been said, I still don't see how that's really going to
work. Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader. The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker. Then the
deadlock detector would know to abort some transaction. But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then? We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things. Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.
Fair point, not sure. I'll leave this up to Nikhil.
If we're only ever talking about decoding prepared transactions, we
could probably work around all of these problems: have the decoding
process take a heavyweight lock before it begins decoding. Have a
process that wants to execute ROLLBACK PREPARED take a conflicting
heavyweight lock on the same object. The net effect would be that
ROLLBACK PREPARED would simply wait for decoding to finish. That
might be rather lousy from a latency point of view since the
transaction could take an arbitrarily long time to decode, but it
seems safe enough. Possibly you could also design a mechanism for the
ROLLBACK PREPARED command to SIGTERM the processes that are blocking
its lock acquisition, if they are decoding processes. The difference
between this and what you the current patch is doing is that nothing
complex or fragile is happening in the abort pathway itself. The
complicated stuff in both the worker and in the main backend happens
while the transaction is still good and can still be rolled back at
need. This kind of approach won't work if you want to decode
transactions that aren't yet prepared, so if that is the long term
goal then we need to think harder. I'm honestly not sure that problem
has any reasonable solution. The assumption that a running process
can abort at any time is deeply baked into many parts of the system
and for good reasons. Trying to undo that is going to be like trying
to push water up a hill. I think we need to install interlocks in
such a way that any waiting happens before we enter the abort path,
not while we're actually trying to perform the abort. But I don't
know how to do that for a foreground task that's still actively doing
stuff.
Unfortunately it's not just for prepared transactions :-( The reason why
I'm interested in this capability (decoding in-progress xacts) is that
I'd like to use it to stream large transactions before commit, to reduce
replication lag due to limited network bandwidth etc. It's also needed
for things like speculative apply (starting apply before commit) etc.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?
I haven't traced out the control flow completely, but it sure looks to
me like there are places where decodeGroupLeader is checked without
holding any LWLock at all. Also, it looks to me like some places
(like where we're trying to find a PGPROC by XID) we use ProcArrayLock
and in others -- I guess where we're checking the decodeGroupBlah
stuff -- we are using the lock manager locks. I don't know how safe
that is, and there are not a lot of comments justifying it. I also
wonder why we're using the lock manager locks to protect the
decodeGroup stuff rather than backendLock.
FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.
Yeah I wasn't seeing the code the way I thought you were describing it
in that response, but I'm dumb this week so maybe I just
misunderstood.
I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.
It's not a matter of fun; it's a matter of whether it can be made to
work. Don't get me wrong -- I want the ability to decode in-progress
transactions. I complained about that aspect of the design to Andres
when I was reviewing and committing logical slots & logical decoding,
and I complained about it probably more than I complained about any
other aspect of it, largely because it instantaneously generates a
large lag when a bulk load commits. But not liking something about
the way things are is not the same as knowing how to make them better.
I believe there is a way to make it work because I believe there's a
way to make anything work. But I suspect that it's at least one order
of magnitude more complex than this patch currently is, and likely an
altogether different design.
But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.
So the parts that aren't relevant shouldn't be copied over.
That having been said, I still don't see how that's really going to
work. Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader. The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker. Then the
deadlock detector would know to abort some transaction. But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then? We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things. Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.Fair point, not sure. I'll leave this up to Nikhil.
That's fine, but please understand that I think there's a basic design
flaw here that just can't be overcome with any amount of hacking on
the details here. I think we need a much higher-level consideration
of the problem here and probably a lot of new infrastructure to
support it. One idea might be to initially support decoding of
in-progress transactions only if they don't modify any catalog state.
That would leave out a bunch of cases we'd probably like to support,
such as CREATE TABLE + COPY in the same transaction, but it would
likely dodge a lot of really hard problems, too, and we could improve
things later. One approach to the problem of catalog changes would be
to prevent catalog tuples from being removed even after transaction
abort until such time as there's no decoding in progress that might
care about them. That is not by itself sufficient because a
transaction can abort after inserting a heap tuple but before
inserting an index tuple and we can't look at the catalog when it's an
inconsistent state like that and expect reasonable results. But it
helps: for example, if you are decoding a transaction that has
inserted a WAL record with a cmin or cmax value of 4, and you know
that none of the catalog records created by that transaction can have
been pruned, then it should be safe to use a snapshot with CID 3 or
smaller to decode the catalogs. So consider a case like:
BEGIN;
CREATE TABLE blah ... -- command ID 0
COPY blah FROM '/tmp/blah' ... -- command ID 1
Once we see the COPY show up in the WAL, it should be safe to decode
the CREATE TABLE command and figure out what a snapshot with command
ID 0 can see (again, assuming we've suppressed pruning in the catalogs
in a sufficiently-well-considered way). Then, as long as the COPY
command doesn't do any DML via a trigger or a datatype input function
(!) or whatever, we should be able to use that snapshot to decode the
data inserted by COPY. I'm not quite sure what happens if the COPY
does do some DML or something like that -- we might have to stop
decoding until the following command begins in the live transaction,
or something like that. Or maybe we don't have to do that. I'm not
totally sure how the command counter is managed for catalog snapshots.
However it works in detail, we will get into trouble if we ever use a
catalog snapshot that can see a change that the live transaction is
still in the midst of making. Even with pruning prevented, we can
only count on the catalogs to be in a consistent state once the live
transaction has finished the command -- otherwise, for example, it
might have increased pg_class.relnatts but not yet added the
pg_attribute entry at the time it aborts, or something like that. I'm
blathering a little bit but hopefully you get the point: I think the
way forward is for somebody to think carefully through how and under
what circumstances using a catalog snapshot can be made safe even if
an abort has occurred afterwards -- not trying to postpone the abort,
which I think is never going to be right.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/17/2018 08:10 PM, Robert Haas wrote:
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?I haven't traced out the control flow completely, but it sure looks to
me like there are places where decodeGroupLeader is checked without
holding any LWLock at all. Also, it looks to me like some places
(like where we're trying to find a PGPROC by XID) we use ProcArrayLock
and in others -- I guess where we're checking the decodeGroupBlah
stuff -- we are using the lock manager locks. I don't know how safe
that is, and there are not a lot of comments justifying it. I also
wonder why we're using the lock manager locks to protect the
decodeGroup stuff rather than backendLock.FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.Yeah I wasn't seeing the code the way I thought you were describing it
in that response, but I'm dumb this week so maybe I just
misunderstood.I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.It's not a matter of fun; it's a matter of whether it can be made to
work. Don't get me wrong -- I want the ability to decode in-progress
transactions. I complained about that aspect of the design to Andres
when I was reviewing and committing logical slots & logical decoding,
and I complained about it probably more than I complained about any
other aspect of it, largely because it instantaneously generates a
large lag when a bulk load commits. But not liking something about
the way things are is not the same as knowing how to make them better.
I believe there is a way to make it work because I believe there's a
way to make anything work. But I suspect that it's at least one order
of magnitude more complex than this patch currently is, and likely an
altogether different design.
Sure, it may turn out not to work - but how do you know until you try?
We have a well known theater play here, where of the actors is blowing
tobacco smoke into the sink, to try if gold can be created that way.
Which is foolish, but his reasoning is "Someone had to try, to be sure!"
So we're in the phase of blowing tobacco smoke, kinda ;-)
Also, you often discover solutions while investigating approaches that
seem to be unworkable initially. Or entirely new approaches. It sure
happened to me, many times.
There's a great book/movie "Touching the Void" [1] about a climber
falling into a deep crevasse. Unable to climb up, he decides to crawl
down - which is obviously foolish, but he happens to find a way out.
I suppose we're kinda doing the same thing here - crawling down a
crevasse (while still smoking and blowing the tobacco smoke into a sink,
which we happened to find in the crevasse or something).
Anyway, I have no clear idea what changes would be necessary to the
original design of logical decoding to make implementing this easier
now. The decoding in general is quite constrained by how our transam and
WAL stuff works. I suppose Andres thought about this aspect, and I guess
he concluded that (a) it's not needed for v1, and (b) adding it later
will require about the same effort. So in the "better" case we'd end up
waiting for logical decoding much longer, in the worse case we would not
have it at all.
But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.So the parts that aren't relevant shouldn't be copied over.
I'm not sure which parts aren't relevant, but in general I agree that
stuff that is not necessary should not be copied over.
That having been said, I still don't see how that's really going to
work. Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader. The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker. Then the
deadlock detector would know to abort some transaction. But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then? We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things. Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.Fair point, not sure. I'll leave this up to Nikhil.
That's fine, but please understand that I think there's a basic design
flaw here that just can't be overcome with any amount of hacking on
the details here. I think we need a much higher-level consideration
of the problem here and probably a lot of new infrastructure to
support it. One idea might be to initially support decoding of
in-progress transactions only if they don't modify any catalog state.
The problem is you don't know if a transaction does DDL sometime later,
in the part that you might not have decoded yet (or perhaps concurrently
with the decoding). So I don't see how you could easily exclude such
transactions from the decoding ...
That would leave out a bunch of cases we'd probably like to support,
such as CREATE TABLE + COPY in the same transaction, but it would
likely dodge a lot of really hard problems, too, and we could improve
things later. One approach to the problem of catalog changes would be
to prevent catalog tuples from being removed even after transaction
abort until such time as there's no decoding in progress that might
care about them. That is not by itself sufficient because a
transaction can abort after inserting a heap tuple but before
inserting an index tuple and we can't look at the catalog when it's an
inconsistent state like that and expect reasonable results. But it
helps: for example, if you are decoding a transaction that has
inserted a WAL record with a cmin or cmax value of 4, and you know
that none of the catalog records created by that transaction can have
been pruned, then it should be safe to use a snapshot with CID 3 or
smaller to decode the catalogs. So consider a case like:BEGIN;
CREATE TABLE blah ... -- command ID 0
COPY blah FROM '/tmp/blah' ... -- command ID 1Once we see the COPY show up in the WAL, it should be safe to decode
the CREATE TABLE command and figure out what a snapshot with command
ID 0 can see (again, assuming we've suppressed pruning in the catalogs
in a sufficiently-well-considered way). Then, as long as the COPY
command doesn't do any DML via a trigger or a datatype input function
(!) or whatever, we should be able to use that snapshot to decode the
data inserted by COPY.
One obvious issue with this is that it actually does not help with
reducing the replication lag, which is about the main goal of this whole
effort. Because if the COPY is a big data load, waiting until after the
COPY completes gives us pretty much nothing.
I'm not quite sure what happens if the COPY
does do some DML or something like that -- we might have to stop
decoding until the following command begins in the live transaction,
or something like that. Or maybe we don't have to do that. I'm not
totally sure how the command counter is managed for catalog snapshots.
However it works in detail, we will get into trouble if we ever use a
catalog snapshot that can see a change that the live transaction is
still in the midst of making. Even with pruning prevented, we can
only count on the catalogs to be in a consistent state once the live
transaction has finished the command -- otherwise, for example, it
might have increased pg_class.relnatts but not yet added the
pg_attribute entry at the time it aborts, or something like that. I'm
blathering a little bit but hopefully you get the point: I think the
way forward is for somebody to think carefully through how and under
what circumstances using a catalog snapshot can be made safe even if
an abort has occurred afterwards -- not trying to postpone the abort,
which I think is never going to be right.
But isn't this (delaying the catalog cleanup etc.) pretty much the
original approach, implemented by the original patch? Which you also
claimed to be unworkable, IIRC? Or how is this addressing the problems
with broken HOT chains, for example? Those issues were pretty much the
reason why we started looking at alternative approaches, like delaying
the abort ...
I wonder if disabling HOT on catalogs with wal_level=logical would be an
option here. I'm not sure how important HOT on catalogs is, in practice
(it surely does not help with the typical catalog bloat issue, which is
temporary tables, because that's mostly insert+delete). I suppose we
could disable it only when there's a replication slot indicating support
for decoding of in-progress transactions, so that you still get HOT with
plain logical decoding.
I'm sure there will be other obstacles, not just the HOT chain stuff,
but it would mean one step closer to a solution.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 18, 2018 at 10:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
The problem is you don't know if a transaction does DDL sometime later, in
the part that you might not have decoded yet (or perhaps concurrently with
the decoding). So I don't see how you could easily exclude such transactions
from the decoding ...
One idea is that maybe the running transaction could communicate with
the decoding process through shared memory. For example, suppose that
before you begin decoding an ongoing transaction, you have to send
some kind of notification to the process saying "hey, I'm going to
start decoding you" and wait for that process to acknowledge receipt
of that message (say, at the next CFI). Once it acknowledges receipt,
you can begin decoding. Then, we're guaranteed that the foreground
process knows when that it must be careful about catalog changes. If
it's going to make one, it sends a note to the decoding process and
says, hey, sorry, I'm about to do catalog changes, please pause
decoding. Once it gets an acknowledgement that decoding has paused,
it continues its work. Decoding resumes after commit (or maybe
earlier if it's provably safe).
But isn't this (delaying the catalog cleanup etc.) pretty much the original
approach, implemented by the original patch? Which you also claimed to be
unworkable, IIRC? Or how is this addressing the problems with broken HOT
chains, for example? Those issues were pretty much the reason why we started
looking at alternative approaches, like delaying the abort ...
I don't think so. The original approach, IIRC, was to decode after
the abort had already happened, and my objection was that you can't
rely on the state of anything at that point. The approach here is to
wait until the abort is in progress and then basically pause it while
we try to read stuff, but that seems similarly riddled with problems.
The newer approach could be considered an improvement in that you've
tried to get your hands around the problem at an earlier point, but
it's not early enough. To take a very rough analogy, the original
approach was like trying to install a sprinkler system after the
building had already burned down, while the new approach is like
trying to install a sprinkler system when you notice that the building
is on fire. But we need to install the sprinkler system in advance.
That is, we need to make all of the necessary preparations for a
possible abort before the abort occurs. That could perhaps be done by
arranging things so that decoding after an abort is actually still
safe (e.g. by making it look to certain parts of the system as though
the aborted transaction is still in progress until decoding no longer
cares about it) or by making sure that we are never decoding at the
point where a problematic abort happens (e.g. as proposed above, pause
decoding before doing dangerous things).
I wonder if disabling HOT on catalogs with wal_level=logical would be an
option here. I'm not sure how important HOT on catalogs is, in practice (it
surely does not help with the typical catalog bloat issue, which is
temporary tables, because that's mostly insert+delete). I suppose we could
disable it only when there's a replication slot indicating support for
decoding of in-progress transactions, so that you still get HOT with plain
logical decoding.
Are you talking about HOT updates, or HOT pruning? Disabling the
former wouldn't help, and disabling the latter would break VACUUM,
which assumes that any tuple not removed by HOT pruning is not a dead
tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused
by a case where that wasn't true).
I'm sure there will be other obstacles, not just the HOT chain stuff, but it
would mean one step closer to a solution.
Right.
Here's a crazy idea. Instead of disabling HOT pruning or anything
like that, have the decoding process advertise the XID of the
transaction being decoded as its own XID in its PGPROC. Also, using
magic, acquire a lock on that XID even though the foreground
transaction already holds that lock in exclusive mode. Fix the code
(and I'm pretty sure there is some) that relies on an XID appearing in
the procarray only once to no longer make that assumption. Then, if
the foreground process aborts, it will appear to the rest of the
system that the it's still running, so HOT pruning won't remove the
XID, CLOG won't get truncated, people who are waiting to update a
tuple updated by the aborted transaction will keep waiting, etc. We
know that we do the right thing for running transactions, so if we
make this aborted transaction look like it is running and are
sufficiently convincing about the way we do that, then it should also
work. That seems more likely to be able to be made robust than
addressing specific problems (e.g. a tuple might get removed!) one by
one.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/18/2018 04:56 PM, Robert Haas wrote:
On Wed, Jul 18, 2018 at 10:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:The problem is you don't know if a transaction does DDL sometime later, in
the part that you might not have decoded yet (or perhaps concurrently with
the decoding). So I don't see how you could easily exclude such transactions
from the decoding ...One idea is that maybe the running transaction could communicate with
the decoding process through shared memory. For example, suppose that
before you begin decoding an ongoing transaction, you have to send
some kind of notification to the process saying "hey, I'm going to
start decoding you" and wait for that process to acknowledge receipt
of that message (say, at the next CFI). Once it acknowledges receipt,
you can begin decoding. Then, we're guaranteed that the foreground
process knows when that it must be careful about catalog changes. If
it's going to make one, it sends a note to the decoding process and
says, hey, sorry, I'm about to do catalog changes, please pause
decoding. Once it gets an acknowledgement that decoding has paused,
it continues its work. Decoding resumes after commit (or maybe
earlier if it's provably safe).
Let's assume running transaction is holding an exclusive lock on
something. We start decoding it and do this little dance with sending
messages, confirmations etc. The decoding starts, and the plugin asks
for the same lock (and starts waiting). Then the transaction decides to
do some catalog changes, and sends a "pause" message to the decoding.
Who's going to respond, considering the decoding is waiting for the lock
(and it's not easy to jump out, because it might be deep inside the
output plugin, i.e. deep in some extension).
But isn't this (delaying the catalog cleanup etc.) pretty much the original
approach, implemented by the original patch? Which you also claimed to be
unworkable, IIRC? Or how is this addressing the problems with broken HOT
chains, for example? Those issues were pretty much the reason why we started
looking at alternative approaches, like delaying the abort ...I don't think so. The original approach, IIRC, was to decode after
the abort had already happened, and my objection was that you can't
rely on the state of anything at that point.
Pretty much, yes. Clearly there needs to be some sort of coordination
between the transaction and decoding process ...
The approach here is to
wait until the abort is in progress and then basically pause it while
we try to read stuff, but that seems similarly riddled with problems.
Yeah :-(
The newer approach could be considered an improvement in that you've
tried to get your hands around the problem at an earlier point, but
it's not early enough. To take a very rough analogy, the original
approach was like trying to install a sprinkler system after the
building had already burned down, while the new approach is like
trying to install a sprinkler system when you notice that the building
is on fire.
When an oil well is burning, they detonate a small bomb next to it to
extinguish it. What would be the analogy to that, here? pg_resetwal? ;-)
But we need to install the sprinkler system in advance.
Damn causality!
That is, we need to make all of the necessary preparations for a
possible abort before the abort occurs. That could perhaps be done by
arranging things so that decoding after an abort is actually still
safe (e.g. by making it look to certain parts of the system as though
the aborted transaction is still in progress until decoding no longer
cares about it) or by making sure that we are never decoding at the
point where a problematic abort happens (e.g. as proposed above, pause
decoding before doing dangerous things).I wonder if disabling HOT on catalogs with wal_level=logical would be an
option here. I'm not sure how important HOT on catalogs is, in practice (it
surely does not help with the typical catalog bloat issue, which is
temporary tables, because that's mostly insert+delete). I suppose we could
disable it only when there's a replication slot indicating support for
decoding of in-progress transactions, so that you still get HOT with plain
logical decoding.Are you talking about HOT updates, or HOT pruning? Disabling the
former wouldn't help, and disabling the latter would break VACUUM,
which assumes that any tuple not removed by HOT pruning is not a dead
tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused
by a case where that wasn't true).
I'm talking about the issue you described here:
/messages/by-id/CA+TgmoZP0SxEfKW1Pn=ackUj+KdWCxs7PumMAhSYJeZ+_61_GQ@mail.gmail.com
I'm sure there will be other obstacles, not just the HOT chain stuff, but it
would mean one step closer to a solution.Right.
Here's a crazy idea. Instead of disabling HOT pruning or anything
like that, have the decoding process advertise the XID of the
transaction being decoded as its own XID in its PGPROC. Also, using
magic, acquire a lock on that XID even though the foreground
transaction already holds that lock in exclusive mode. Fix the code
(and I'm pretty sure there is some) that relies on an XID appearing in
the procarray only once to no longer make that assumption. Then, if
the foreground process aborts, it will appear to the rest of the
system that the it's still running, so HOT pruning won't remove the
XID, CLOG won't get truncated, people who are waiting to update a
tuple updated by the aborted transaction will keep waiting, etc. We
know that we do the right thing for running transactions, so if we
make this aborted transaction look like it is running and are
sufficiently convincing about the way we do that, then it should also
work. That seems more likely to be able to be made robust than
addressing specific problems (e.g. a tuple might get removed!) one by
one.
A dumb question - would this work with subtransaction-level aborts? I
mean, a transaction that does some catalog changes in a subxact which
then however aborts, but then still continues.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 18, 2018 at 11:27 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
One idea is that maybe the running transaction could communicate with
the decoding process through shared memory. For example, suppose that
before you begin decoding an ongoing transaction, you have to send
some kind of notification to the process saying "hey, I'm going to
start decoding you" and wait for that process to acknowledge receipt
of that message (say, at the next CFI). Once it acknowledges receipt,
you can begin decoding. Then, we're guaranteed that the foreground
process knows when that it must be careful about catalog changes. If
it's going to make one, it sends a note to the decoding process and
says, hey, sorry, I'm about to do catalog changes, please pause
decoding. Once it gets an acknowledgement that decoding has paused,
it continues its work. Decoding resumes after commit (or maybe
earlier if it's provably safe).Let's assume running transaction is holding an exclusive lock on something.
We start decoding it and do this little dance with sending messages,
confirmations etc. The decoding starts, and the plugin asks for the same
lock (and starts waiting). Then the transaction decides to do some catalog
changes, and sends a "pause" message to the decoding. Who's going to
respond, considering the decoding is waiting for the lock (and it's not easy
to jump out, because it might be deep inside the output plugin, i.e. deep in
some extension).
I think it's inevitable that any solution that is based on pausing
decoding might have to wait for a theoretically unbounded time for
decoding to get back to a point where it can safely pause. That is
one of several reasons why I don't believe that any solution based on
holding off aborts has any chance of being acceptable -- mid-abort is
a terrible time to pause. Now, if the time is not only theoretically
unbounded but also in practice likely to be very long (e.g. the
foreground transaction could easily have to wait minutes for the
decoding process to be able to process the pause request), then this
whole approach is probably not going to work. If, on the other hand,
the time is theoretically unbounded but in practice likely to be no
more than a few seconds in almost every case, then we might have
something. I don't know which is the case. It probably depends on
where you put the code to handle pause requests, and I'm not sure what
options are viable. For example, if there's a loop that eats WAL
records one at a time, and we can safely pause after any given
iteration of that loop, that sounds pretty good, unless a single
iteration of that loop might hang inside of a network I/O, in which
case it sounds ... less good, probably? But there might be ways
around that, too, like ... could we pause at the next CFI? I don't
understand the constraints well enough to comment intelligently here.
The newer approach could be considered an improvement in that you've
tried to get your hands around the problem at an earlier point, but
it's not early enough. To take a very rough analogy, the original
approach was like trying to install a sprinkler system after the
building had already burned down, while the new approach is like
trying to install a sprinkler system when you notice that the building
is on fire.When an oil well is burning, they detonate a small bomb next to it to
extinguish it. What would be the analogy to that, here? pg_resetwal? ;-)
Yep. :-)
But we need to install the sprinkler system in advance.
Damn causality!
I know, right?
Are you talking about HOT updates, or HOT pruning? Disabling the
former wouldn't help, and disabling the latter would break VACUUM,
which assumes that any tuple not removed by HOT pruning is not a dead
tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused
by a case where that wasn't true).I'm talking about the issue you described here:
/messages/by-id/CA+TgmoZP0SxEfKW1Pn=ackUj+KdWCxs7PumMAhSYJeZ+_61_GQ@mail.gmail.com
There are several issues there. The second and third ones boil down
to this: As soon as the system thinks that your transaction is no
longer in process, it is going to start making decisions based on
whether that transaction committed or aborted. If it thinks your
transaction aborted, it is going to feel entirely free to make
decisions that permanently lose information -- like removing tuples or
overwriting CTIDs or truncating CLOG or killing index entries. I
doubt it makes any sense to try to fix each of those problems
individually -- if we're going to do something about this, it had
better be broad enough to nail all or nearly all of the problems in
this area in one fell swoop.
The first issue in that email is different. That's really about the
possibility that the aborted transaction itself has created chaos,
whereas the other ones are about the chaos that the rest of the system
might impose based on the belief that the transaction is no longer
needed for anything after an abort has occurred.
A dumb question - would this work with subtransaction-level aborts? I mean,
a transaction that does some catalog changes in a subxact which then however
aborts, but then still continues.
Well, I would caution you against relying on me to design this for
you. The fact that I can identify the pitfalls of trying to install a
sprinkler system while the building is on fire does not mean that I
know what diameter of pipe should be used to provide for proper fire
containment. It's really important that this gets designed by someone
who knows -- or learns -- enough to make it really good and safe.
Replacing obvious problems (the building has already burned down!)
with subtler problems (the water pressure is insufficient to reach the
upper stories!) might get the patch committed, but that's not the
right goal.
That having been said, I cannot immediately see any reason why the
idea that I sketched there couldn't be made to work just as well or
poorly for subtransactions as it would for toplevel transactions. I
don't really know that it will work even for toplevel transactions --
that would require more thought and careful study than I've given it
(or, given that this is not my patch, feel that I should need to give
it). However, if it does, and if there are no other problems that
I've missed in thinking casually about it, then I think it should be
possible to make it work for subtransactions, too. Likely, as the
decoding process first encountered each new sub-XID, it would need to
magically acquire a duplicate lock and advertise the subxid just as it
did for the toplevel XID, so that at any given time the set of XIDs
advertised by the decoding process would be a subset (not necessarily
proper) of the set advertised by the foreground process.
To try to be a little clearer about my overall position, I am
suggesting that you (1) abandon the current approach and (2) make sure
that everything is done by making sufficient preparations in advance
of any abort rather than trying to cope after it's already started. I
am also suggesting that, to get there, it might be helpful to (a)
contemplate communication and active cooperation between the running
process and the decoding process(es), but it might turn out not to be
needed and I don't know exactly what needs to be communicated, (b)
consider whether it there's a reasonable way to make it look to other
parts of the system like the aborted transaction is still running, but
this also might turn out not to be the right approach, (c) consider
whether logical decoding already does or can be made to use historical
catalog snapshots that only see command IDs prior to the current one
so that incompletely-made changes by the last CID aren't seen if an
abort happens. I think there is a good chance that a full solution
involves more than one of these things, and maybe some other things I
haven't thought about. These are ideas, not a plan.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi Robert and Tomas,
It seems clear to me that the decodeGroup list of decoding backends
waiting on the backend doing the transaction of interest is not a
favored approach here. Note that I came down to this approach after
trying various other approaches/iterations. I was especially enthused
to see the lockGroupLeader implementation in the code and based this
decodeGroup implementation on the same premise. Although our
requirements are simply to have a list of waiters in the main
transaction backend process.
Sure, there might be some issues related to locking in the code, and
am willing to try and work them out. However if the decodeGroup
approach of interlocking abort processing with the decoding backends
is itself considered suspect, then it might be another waste of time.
I think it's inevitable that any solution that is based on pausing
decoding might have to wait for a theoretically unbounded time for
decoding to get back to a point where it can safely pause. That is
one of several reasons why I don't believe that any solution based on
holding off aborts has any chance of being acceptable -- mid-abort is
a terrible time to pause. Now, if the time is not only theoretically
unbounded but also in practice likely to be very long (e.g. the
foreground transaction could easily have to wait minutes for the
decoding process to be able to process the pause request), then this
whole approach is probably not going to work. If, on the other hand,
the time is theoretically unbounded but in practice likely to be no
more than a few seconds in almost every case, then we might have
something. I don't know which is the case.
We have tried to minimize the pausing requirements by holding the
"LogicalLock" only when the decoding activity needs to access catalog
tables. The decoding goes ahead only if it gets the logical lock,
reads the catalog and unlocks immediately. If the decoding backend
does not get the "LogicalLock" then it stops decoding the current
transaction. So, the time to pause is pretty short in practical
scenarios.
It probably depends on
where you put the code to handle pause requests, and I'm not sure what
options are viable. For example, if there's a loop that eats WAL
records one at a time, and we can safely pause after any given
iteration of that loop, that sounds pretty good, unless a single
iteration of that loop might hang inside of a network I/O, in which
case it sounds ... less good, probably?
It's for the above scenarios of not waiting inside network I/O that we
lock only before doing catalog access as described above.
There are several issues there. The second and third ones boil down
to this: As soon as the system thinks that your transaction is no
longer in process, it is going to start making decisions based on
whether that transaction committed or aborted. If it thinks your
transaction aborted, it is going to feel entirely free to make
decisions that permanently lose information -- like removing tuples or
overwriting CTIDs or truncating CLOG or killing index entries. I
doubt it makes any sense to try to fix each of those problems
individually -- if we're going to do something about this, it had
better be broad enough to nail all or nearly all of the problems in
this area in one fell swoop.
Agreed, this was the crux of the issues. Decisions that cause
permanent loss of information regardless of the ongoing decoding
happening around that transaction was what led us down this rabbit
hole in the first place.
A dumb question - would this work with subtransaction-level aborts? I mean,
a transaction that does some catalog changes in a subxact which then however
aborts, but then still continues.That having been said, I cannot immediately see any reason why the
idea that I sketched there couldn't be made to work just as well or
poorly for subtransactions as it would for toplevel transactions. I
don't really know that it will work even for toplevel transactions --
that would require more thought and careful study than I've given it
(or, given that this is not my patch, feel that I should need to give
it). However, if it does, and if there are no other problems that
I've missed in thinking casually about it, then I think it should be
possible to make it work for subtransactions, too. Likely, as the
decoding process first encountered each new sub-XID, it would need to
magically acquire a duplicate lock and advertise the subxid just as it
did for the toplevel XID, so that at any given time the set of XIDs
advertised by the decoding process would be a subset (not necessarily
proper) of the set advertised by the foreground process.
Am ready to go back to the drawing board and have another stab at this
pesky little large issue :-)
To try to be a little clearer about my overall position, I am
suggesting that you (1) abandon the current approach and (2) make sure
that everything is done by making sufficient preparations in advance
of any abort rather than trying to cope after it's already started. I
am also suggesting that, to get there, it might be helpful to (a)
contemplate communication and active cooperation between the running
process and the decoding process(es), but it might turn out not to be
needed and I don't know exactly what needs to be communicated, (b)
consider whether it there's a reasonable way to make it look to other
parts of the system like the aborted transaction is still running, but
this also might turn out not to be the right approach, (c) consider
whether logical decoding already does or can be made to use historical
catalog snapshots that only see command IDs prior to the current one
so that incompletely-made changes by the last CID aren't seen if an
abort happens. I think there is a good chance that a full solution
involves more than one of these things, and maybe some other things I
haven't thought about. These are ideas, not a plan.
I will think more on the above lines and see if we can get something workable..
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2018-07-18 10:56:31 -0400, Robert Haas wrote:
Are you talking about HOT updates, or HOT pruning? Disabling the
former wouldn't help, and disabling the latter would break VACUUM,
which assumes that any tuple not removed by HOT pruning is not a dead
tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused
by a case where that wasn't true).
I don't think this reasoning actually applies for making HOT pruning
weaker as necessary for decoding. The xmin horizon on catalog tables is
already pegged, which'd prevent similar problems.
There's already plenty cases where dead tuples, if they only recently
became so, are not removed by the time vacuumlazy.c processes the tuple.
I actually think the balance of all the solutions discussed in this
thread seem to make neutering pruning *a bit* by far the most palatable
solution. We don't need to fully prevent removal of such tuple chains,
it's sufficient that we can detect that a tuple has been removed. A
large-sledgehammer approach would be to just error out when attempting
to read such a tuple. The existing error handling logic can relatively
easily be made to work with that.
Greetings,
Andres Freund
Hi,
On 2018-07-18 16:08:37 +0200, Tomas Vondra wrote:
Anyway, I have no clear idea what changes would be necessary to the original
design of logical decoding to make implementing this easier now. The
decoding in general is quite constrained by how our transam and WAL stuff
works. I suppose Andres thought about this aspect, and I guess he concluded
that (a) it's not needed for v1, and (b) adding it later will require about
the same effort. So in the "better" case we'd end up waiting for logical
decoding much longer, in the worse case we would not have it at all.
I still don't really see an alternative that'd have been (or even *is*)
realistically doable.
Greetings,
Andres Freund
Hi,
On 2018-07-19 12:42:08 -0700, Andres Freund wrote:
I actually think the balance of all the solutions discussed in this
thread seem to make neutering pruning *a bit* by far the most palatable
solution. We don't need to fully prevent removal of such tuple chains,
it's sufficient that we can detect that a tuple has been removed. A
large-sledgehammer approach would be to just error out when attempting
to read such a tuple. The existing error handling logic can relatively
easily be made to work with that.
So. I'm just back from not working for a few days. I've not followed
this discussion in all it's detail over the last months. I've an
annoying bout of allergies. So I might be entirely off.
I think this whole issue only exists if we actually end up doing catalog
lookups, not if there's only cached lookups (otherwise our invalidation
handling is entirely borked). And we should normally do cached lookups
for a large large percentage of the cases. Therefore we can make the
cache-miss cases a bit slower.
So what if we, at the begin / end of cache miss handling, re-check if
the to-be-decoded transaction is still in-progress (or has
committed). And we throw an error if that happened. That error is then
caught in reorderbuffer, the in-progress-xact aborted callback is
called, and processing continues (there's a couple nontrivial details
here, but it should be doable).
The biggest issue is what constitutes a "cache miss". It's fairly
trivial to do this for syscache / relcache, but that's not sufficient:
there's plenty cases where catalogs are accessed without going through
either. But as far as I can tell if we declared that all historic
accesses have to go through systable_beginscan* - which'd imo not be a
crazy restriction - we could put the checks at that layer.
That'd require that an index lookup can't crash if the corresponding
heap entry doesn't exist (etc), but that's something we need to handle
anyway. The issue that multiple separate catalog lookups need to be
coherent (say Robert's pg_class exists, but pg_attribute doesn't
example) is solved by virtue of the the pg_attribute lookups failing if
the transaction aborted.
Am I missing something here?
Greetings,
Andres Freund
Hi Andres,
So what if we, at the begin / end of cache miss handling, re-check if
the to-be-decoded transaction is still in-progress (or has
committed). And we throw an error if that happened. That error is then
caught in reorderbuffer, the in-progress-xact aborted callback is
called, and processing continues (there's a couple nontrivial details
here, but it should be doable).The biggest issue is what constitutes a "cache miss". It's fairly
trivial to do this for syscache / relcache, but that's not sufficient:
there's plenty cases where catalogs are accessed without going through
either. But as far as I can tell if we declared that all historic
accesses have to go through systable_beginscan* - which'd imo not be a
crazy restriction - we could put the checks at that layer.
Documenting that historic accesses go through systable_* APIs does
seem reasonable. In our earlier discussions, we felt asking plugin
writers to do anything along these lines was too onerous and
cumbersome to expect.
That'd require that an index lookup can't crash if the corresponding
heap entry doesn't exist (etc), but that's something we need to handle
anyway. The issue that multiple separate catalog lookups need to be
coherent (say Robert's pg_class exists, but pg_attribute doesn't
example) is solved by virtue of the the pg_attribute lookups failing if
the transaction aborted.Am I missing something here?
Are you suggesting we have a:
PG_TRY()
{
Catalog_Access();
}
PG_CATCH()
{
Abort_Handling();
}
here?
Regards,
Nikhils
On 2018-07-20 12:13:19 +0530, Nikhil Sontakke wrote:
Hi Andres,
So what if we, at the begin / end of cache miss handling, re-check if
the to-be-decoded transaction is still in-progress (or has
committed). And we throw an error if that happened. That error is then
caught in reorderbuffer, the in-progress-xact aborted callback is
called, and processing continues (there's a couple nontrivial details
here, but it should be doable).The biggest issue is what constitutes a "cache miss". It's fairly
trivial to do this for syscache / relcache, but that's not sufficient:
there's plenty cases where catalogs are accessed without going through
either. But as far as I can tell if we declared that all historic
accesses have to go through systable_beginscan* - which'd imo not be a
crazy restriction - we could put the checks at that layer.Documenting that historic accesses go through systable_* APIs does
seem reasonable. In our earlier discussions, we felt asking plugin
writers to do anything along these lines was too onerous and
cumbersome to expect.
But they don't really need to do that - in just about all cases access
"automatically" goes through systable_* or layers above. If you call
output functions, do syscache lookups, etc you're good.
That'd require that an index lookup can't crash if the corresponding
heap entry doesn't exist (etc), but that's something we need to handle
anyway. The issue that multiple separate catalog lookups need to be
coherent (say Robert's pg_class exists, but pg_attribute doesn't
example) is solved by virtue of the the pg_attribute lookups failing if
the transaction aborted.Am I missing something here?
Are you suggesting we have a:
PG_TRY()
{
Catalog_Access();
}
PG_CATCH()
{
Abort_Handling();
}here?
Not quite, no. Basically, in a simplified manner, the logical decoding
loop is like:
while (true)
record = readRecord()
logical = decodeRecord()
PG_TRY():
StartTransactionCommand();
switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...
CommitTransactionCommand();
PG_CATCH():
AbortCurrentTransaction();
PG_RE_THROW();
what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions". The
PG_CATCH() above would then not unconditionally re-throw, but set a flag
and continue iff that class of error was detected.
while (true)
if (in_progress_xact_abort_pending)
StartTransactionCommand();
in_progress_xact_abort_callback(made_up_record);
in_progress_xact_abort_pending = false;
CommitTransactionCommand();
record = readRecord()
logical = decodeRecord()
PG_TRY():
StartTransactionCommand();
switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...
CommitTransactionCommand();
PG_CATCH():
AbortCurrentTransaction();
if (errclass == DECODING_ABORTED_XACT)
in_progress_xact_abort_pending = true;
continue;
else
PG_RE_THROW();
Now obviously that's just pseudo code with lotsa things missing, but I
think the basic idea should come through?
Greetings,
Andres Freund
Hi Andres,
That'd require that an index lookup can't crash if the corresponding
heap entry doesn't exist (etc), but that's something we need to handle
anyway. The issue that multiple separate catalog lookups need to be
coherent (say Robert's pg_class exists, but pg_attribute doesn't
example) is solved by virtue of the the pg_attribute lookups failing if
the transaction aborted.Not quite, no. Basically, in a simplified manner, the logical decoding
loop is like:while (true)
record = readRecord()
logical = decodeRecord()PG_TRY():
StartTransactionCommand();switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...CommitTransactionCommand();
PG_CATCH():
AbortCurrentTransaction();
PG_RE_THROW();what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions".
When will this error be thrown by the catalog functions? How will it
determine that it needs to throw this error?
PG_CATCH():
AbortCurrentTransaction();
if (errclass == DECODING_ABORTED_XACT)
in_progress_xact_abort_pending = true;
continue;
else
PG_RE_THROW();Now obviously that's just pseudo code with lotsa things missing, but I
think the basic idea should come through?
How do we handle the cases where the catalog returns inconsistent data
(without erroring out) which does not help with the ongoing decoding?
Consider for example:
BEGIN;
/* CONSIDER T1 has one column C1 */
ALTER TABLE T1 ADD COL c2;
INSERT INTO TABLE T1(c2) VALUES;
PREPARE TRANSACTION;
If we abort the above 2PC and the catalog row for the ALTER gets
cleaned up by vacuum, then the catalog read will return us T1 with one
column C1. The catalog scan will NOT error out but will return
metadata which causes the insert-decoding change apply callback to
error out. The point here is that in some cases the catalog scan might
not error out and might return inconsistent metadata which causes
issues further down the line in apply processing.
Regards,
Nikhils
Greetings,
Andres Freund
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
On 2018-07-23 16:31:50 +0530, Nikhil Sontakke wrote:
That'd require that an index lookup can't crash if the corresponding
heap entry doesn't exist (etc), but that's something we need to handle
anyway. The issue that multiple separate catalog lookups need to be
coherent (say Robert's pg_class exists, but pg_attribute doesn't
example) is solved by virtue of the the pg_attribute lookups failing if
the transaction aborted.Not quite, no. Basically, in a simplified manner, the logical decoding
loop is like:while (true)
record = readRecord()
logical = decodeRecord()PG_TRY():
StartTransactionCommand();switch (TypeOf(logical))
case INSERT:
insert_callback(logical);
break;
...CommitTransactionCommand();
PG_CATCH():
AbortCurrentTransaction();
PG_RE_THROW();what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions".When will this error be thrown by the catalog functions? How will it
determine that it needs to throw this error?
The error check would have to happen at the end of most systable_*
functions. They'd simply do something like
if (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted))
ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops")));
i.e. check whether the transaction to be decoded still is in
progress. As that would happen before any potentially wrong result can
be returned (as the check happens at the tail end of systable_*),
there's no issue with wrong state in the syscache etc.
PG_CATCH():
AbortCurrentTransaction();
if (errclass == DECODING_ABORTED_XACT)
in_progress_xact_abort_pending = true;
continue;
else
PG_RE_THROW();Now obviously that's just pseudo code with lotsa things missing, but I
think the basic idea should come through?How do we handle the cases where the catalog returns inconsistent data
(without erroring out) which does not help with the ongoing decoding?
Consider for example:
I don't think that situation exists, given the scheme described
above. That's just the point.
BEGIN;
/* CONSIDER T1 has one column C1 */
ALTER TABLE T1 ADD COL c2;
INSERT INTO TABLE T1(c2) VALUES;
PREPARE TRANSACTION;If we abort the above 2PC and the catalog row for the ALTER gets
cleaned up by vacuum, then the catalog read will return us T1 with one
column C1.
No, it'd throw an error due to the bew is-aborted check.
The catalog scan will NOT error out but will return metadata which
causes the insert-decoding change apply callback to error out.
Why would it not throw an error?
Greetings,
Andres Freund
Hi Andres,
what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions".When will this error be thrown by the catalog functions? How will it
determine that it needs to throw this error?The error check would have to happen at the end of most systable_*
functions. They'd simply do something likeif (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted))
ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops")));i.e. check whether the transaction to be decoded still is in
progress. As that would happen before any potentially wrong result can
be returned (as the check happens at the tail end of systable_*),
there's no issue with wrong state in the syscache etc.
Oh, ok. The systable_* functions use the passed in snapshot and return
tuples matching to it. They do not typically have access to the
current XID being worked upon..
We can find out if the snapshot is a logical decoding one by virtue of
its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC.
The catalog scan will NOT error out but will return metadata which
causes the insert-decoding change apply callback to error out.Why would it not throw an error?
In your scheme, it will throw an error, indeed. We'd need to make the
"being-currently-decoded-XID" visible to these systable_* functions
and then this scheme will work.
Regards,
Nikhils
Greetings,
Andres Freund
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2018-07-23 19:37:46 +0530, Nikhil Sontakke wrote:
Hi Andres,
what I'm proposing is that that various catalog access functions throw a
new class of error, something like "decoding aborted transactions".When will this error be thrown by the catalog functions? How will it
determine that it needs to throw this error?The error check would have to happen at the end of most systable_*
functions. They'd simply do something likeif (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted))
ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops")));i.e. check whether the transaction to be decoded still is in
progress. As that would happen before any potentially wrong result can
be returned (as the check happens at the tail end of systable_*),
there's no issue with wrong state in the syscache etc.Oh, ok. The systable_* functions use the passed in snapshot and return
tuples matching to it. They do not typically have access to the
current XID being worked upon..
That seems like quite a solvable issue, especially compared to the
locking schemes proposed.
We can find out if the snapshot is a logical decoding one by virtue of
its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC.
I think we even can just do something like a global
TransactionId check_if_transaction_is_alive = InvalidTransactionId;
and just set it up during decoding. And then just check it whenever it's
not set tot InvalidTransactionId.
Greetings,
Andres Freund
Hi Andres,
We can find out if the snapshot is a logical decoding one by virtue of
its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC.I think we even can just do something like a global
TransactionId check_if_transaction_is_alive = InvalidTransactionId;
and just set it up during decoding. And then just check it whenever it's
not set tot InvalidTransactionId.
Ok. I will work on something along these lines and re-submit the set of patches.
Regards,
Nikhils
On Thu, Jul 19, 2018 at 3:42 PM, Andres Freund <andres@anarazel.de> wrote:
I don't think this reasoning actually applies for making HOT pruning
weaker as necessary for decoding. The xmin horizon on catalog tables is
already pegged, which'd prevent similar problems.
That sounds completely wrong to me. Setting the xmin horizon keeps
tuples that are made dead by a committing transaction from being
removed, but I don't think it will do anything to keep tuples that are
made dead by an aborting transaction from being removed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On July 23, 2018 9:11:13 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 19, 2018 at 3:42 PM, Andres Freund <andres@anarazel.de>
wrote:I don't think this reasoning actually applies for making HOT pruning
weaker as necessary for decoding. The xmin horizon on catalog tablesis
already pegged, which'd prevent similar problems.
That sounds completely wrong to me. Setting the xmin horizon keeps
tuples that are made dead by a committing transaction from being
removed, but I don't think it will do anything to keep tuples that are
made dead by an aborting transaction from being removed.
My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Jul 23, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced.
That doesn't seem sufficient. For example, it won't keep the
predecessor tuple's ctid field from being overwritten by a subsequent
updater -- and if that happens then the update chain is broken. Maybe
your idea of cross-checking at the end of each syscache lookup would
be sufficient to prevent that from happening, though. But I wonder if
there are subtler problems, too -- e.g. relfrozenxid vs. actual xmins
in the table, clog truncation, or whatever. There might be no
problem, but the idea that an aborted transaction is of no further
interest to anybody is pretty deeply ingrained in the system.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2018-07-23 12:38:25 -0400, Robert Haas wrote:
On Mon, Jul 23, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced.
That doesn't seem sufficient. For example, it won't keep the
predecessor tuple's ctid field from being overwritten by a subsequent
updater -- and if that happens then the update chain is broken.
Sure. I wasn't arguing that it'd be sufficient. Just that the specific
issue that it'd bring the bug you mentioned isn't right. I agree that
it's quite terrifying to attempt to get this right.
Maybe your idea of cross-checking at the end of each syscache lookup
would be sufficient to prevent that from happening, though.
Hm? If we go for that approach we would not do *anything* about pruning,
which is why I think it has appeal. Because we'd check at the end of
system table scans (not syscache lookups, positive cache hits are fine
because of invalidation handling) whether the to-be-decoded transaction
aborted, we'd not need to do anything about pruning: If the transaction
aborted, we're guaranteed to know - the result might have been wrong,
but since we error out before filling any caches, we're ok. If it
hasn't yet aborted at the end of the scan, we conversely are guaranteed
that the scan results are correct.
Greetings,
Andres Freund
Hi,
I think we even can just do something like a global
TransactionId check_if_transaction_is_alive = InvalidTransactionId;
and just set it up during decoding. And then just check it whenever it's
not set tot InvalidTransactionId.Ok. I will work on something along these lines and re-submit the set of patches.
PFA, latest patchset, which completely removes the earlier
LogicalLock/LogicalUnLock implementation using groupDecode stuff and
uses the newly suggested approach of checking the currently decoded
XID for abort in systable_* API functions. Much simpler to code and
easier to test as well.
Out of the patchset, the specific patch which focuses on the above
systable_* API based XID checking implementation is part of
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch. So,
it might help to take a look at this patch first for any additional
feedback on this approach.
There's an additional test case in
0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which
uses a sleep in the "change" plugin API to allow a concurrent rollback
on the 2PC being currently decoded. Andres generally doesn't like this
approach :-), but there are no timing/interlocking issues now, and the
sleep just helps us do a concurrent rollback, so it might be ok now,
all things considered. Anyways, it's an additional patch for now.
Comments, feedback appreciated.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 064e48176a355d94d59cf321750dc3079e4af9d4 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/5] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++----------
2 files changed, 37 insertions(+), 30 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9b55b94227..fb71631434 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -643,7 +643,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -663,7 +663,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -686,7 +686,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -746,7 +746,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -762,7 +762,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -972,7 +972,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1001,7 +1001,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1167,7 +1167,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1208,7 +1208,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1223,7 +1223,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1240,7 +1240,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1854,7 +1854,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2002,7 +2002,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2109,7 +2109,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2126,7 +2126,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2146,7 +2146,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2267,7 +2267,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1f52f6bde7..ec9515d156 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.2 (Apple Git-101.1)
0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 4fd2180bd13502fb642e06fceafb1fad5421b271 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/5] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supposts this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 128 ++++++++++++++-
src/backend/replication/logical/decode.c | 147 +++++++++++++++--
src/backend/replication/logical/logical.c | 202 ++++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 185 ++++++++++++++++++++--
src/include/replication/logical.h | 7 +-
src/include/replication/output_plugin.h | 45 ++++++
src/include/replication/reorderbuffer.h | 68 ++++++++
7 files changed, 750 insertions(+), 32 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..7e9213def2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,74 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it. It might make
+ sense, even before we commence decoding, in such cases to check if the
+ rollback happened even before we start looking at the changes to
+ completely avoid the decoding of such transactions.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +646,12 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +728,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +782,12 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback is ensured sane access to catalog tables regardless of
+ simultaneous rollback by another backend of this very same transaction.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9c..008958d35e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -281,16 +284,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -633,9 +653,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
buf->origptr, buf->endptr);
}
+ /*
+ * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+ * Regular commit simply triggers a replay of transaction changes from the
+ * reorder buffer. For COMMIT PREPARED that however already happened at
+ * PREPARE time, and so we only need to notify the subscriber that the GID
+ * finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (TransactionIdIsValid(parsed->twophase_xid) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder,
+ parsed->twophase_xid, parsed->twophase_gid))
+ {
+ Assert(xid == parsed->twophase_xid);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ int i;
+ TransactionId xid = parsed->twophase_xid;
+
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+ parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+
+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ {
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+ }
+ ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+ return;
+ }
+
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn, parsed->twophase_gid);
}
/*
@@ -647,6 +748,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3cd4eefb9b..d3b9452be3 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -127,6 +137,7 @@ StartupDecodingContext(List *output_plugin_options,
MemoryContext context,
old_context;
LogicalDecodingContext *ctx;
+ int twophase_callbacks;
/* shorter lines... */
slot = MyReplicationSlot;
@@ -187,8 +198,38 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
+ /*
+ * Check that plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ ctx->enable_twophase = (twophase_callbacks == 3);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
ctx->out = makeStringInfo();
ctx->prepare_write = prepare_write;
ctx->write = do_write;
@@ -708,6 +749,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -785,6 +942,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb71631434..2fffc90606 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1389,25 +1394,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1711,7 +1709,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1727,8 +1724,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1755,7 +1766,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1789,6 +1805,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..0e80f5697e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
@@ -89,6 +89,11 @@ typedef struct LogicalDecodingContext
bool prepared_write;
XLogRecPtr write_location;
TransactionId write_xid;
+
+ /*
+ * Capabilities of the output plugin.
+ */
+ bool enable_twophase;
} LogicalDecodingContext;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..e4070aa8a2 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -109,7 +149,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ec9515d156..285c9b53da 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -416,6 +470,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -439,6 +498,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.15.2 (Apple Git-101.1)
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From 75edeb440794fff7de48082dafdecb065940bee5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/5] Gracefully handle concurrent aborts of uncommitted
transactions that are being decoded alongside.
When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
src/backend/access/index/genam.c | 31 +++++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 30 ++++++++++++++++++++----
src/backend/utils/time/snapmgr.c | 25 ++++++++++++++++++--
src/include/utils/snapmgr.h | 4 +++-
4 files changed, 82 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d08775687..67c5810bf7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan)
else
htup = heap_getnext(sysscan->scan, ForwardScanDirection);
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ TransactionIdDidAbort(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
@@ -476,6 +486,17 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
}
+
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ TransactionIdDidAbort(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return result;
}
@@ -593,6 +614,16 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
if (htup && sysscan->iscan->xs_recheck)
elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ TransactionIdDidAbort(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2fffc90606..8f4d63eb5b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -599,7 +599,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
/* setup snapshot to allow catalog access */
- SetupHistoricSnapshot(snapshot_now, NULL);
+ SetupHistoricSnapshot(snapshot_now, NULL, xid);
PG_TRY();
{
rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1405,6 +1405,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
+ MemoryContext ccxt = CurrentMemoryContext;
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
@@ -1431,7 +1432,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
ReorderBufferBuildTupleCidHash(rb, txn);
/* setup the initial snapshot */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Decoding needs access to syscaches et al., which in turn use
@@ -1672,7 +1673,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* and continue with the new one */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1692,7 +1693,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
snapshot_now->curcid = command_id;
TeardownHistoricSnapshot(false);
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Every time the CommandId is incremented, we could
@@ -1777,6 +1778,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
PG_CATCH();
{
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
+
+ /*
+ * if the catalog scan access returned an error of
+ * rollback, then abort on the other side as well
+ */
+ if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+ rb->abort(rb, txn, commit_lsn);
+ }
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1800,7 +1815,12 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* remove potential on-disk data, and deallocate */
ReorderBufferCleanupTXN(rb, txn);
- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
}
PG_END_TRY();
}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..0354fc9da9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -151,6 +151,13 @@ static Snapshot SecondarySnapshot = NULL;
static Snapshot CatalogSnapshot = NULL;
static Snapshot HistoricSnapshot = NULL;
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
* access to behave just like it did at a certain point in the past.
*
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive. This is to re-check XID status while accessing catalog.
+ *
* Needed for logical decoding.
*/
void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+ TransactionId snapshot_xid)
{
Assert(historic_snapshot != NULL);
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
/* setup (cmin, cmax) lookup hash */
tuplecid_data = tuplecids;
-}
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check
+ * if the xid aborted. That will happen during catalog access.
+ */
+ if (TransactionIdIsValid(snapshot_xid) &&
+ !TransactionIdDidCommit(snapshot_xid))
+ CheckXidAlive = snapshot_xid;
+ else
+ CheckXidAlive = InvalidTransactionId;
+}
/*
* Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
{
HistoricSnapshot = NULL;
tuplecid_data = NULL;
+ CheckXidAlive = InvalidTransactionId;
}
bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..bad2053477 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
/* Support for catalog timetravel for logical decoding */
struct HTAB;
+extern TransactionId CheckXidAlive;
extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+ TransactionId snapshot_xid);
extern void TeardownHistoricSnapshot(bool is_error);
extern bool HistoricSnapshotActive(void);
--
2.15.2 (Apple Git-101.1)
0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 80fc576bda483798919653991bef6dc198625d90 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC
Includes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
---
contrib/test_decoding/expected/prepared.out | 257 +++++++++++++++++++++++++---
contrib/test_decoding/sql/prepared.sql | 84 ++++++++-
contrib/test_decoding/test_decoding.c | 137 +++++++++++++++
3 files changed, 451 insertions(+), 27 deletions(-)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..5df7b7ff20 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,82 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
+ ?column?
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +89,193 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4
+ COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
+ table public.test_prepared1: INSERT: id[integer]:5
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:4
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------
BEGIN
table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:9
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ COMMIT
+(3 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ COMMIT
+(4 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
@@ -72,3 +283,9 @@ SELECT pg_drop_replication_slot('regression_slot');
(1 row)
+SELECT pg_drop_replication_slot('regression_slot_2pc');
+ pg_drop_replication_slot
+--------------------------
+
+(1 row)
+
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..e8eb8ad8d6 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -1,22 +1,31 @@
-- predictability
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_2pc', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +36,85 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't see anything with 2pc decoding off
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- Both will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot_2pc', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');
+
SELECT pg_drop_replication_slot('regression_slot');
+SELECT pg_drop_replication_slot('regression_slot_2pc');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1c439b57b0..140010a8b1 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -36,6 +36,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ bool enable_twophase;
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +50,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +65,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +95,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,6 +122,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->enable_twophase = false;
ctx->output_plugin_private = data;
@@ -183,6 +204,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "enable-twophase") == 0)
+ {
+ if (elem->arg == NULL)
+ data->enable_twophase = true;
+ else if (!parse_bool(strVal(elem->arg), &data->enable_twophase))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not parse value \"%s\" for parameter \"%s\"",
+ strVal(elem->arg), elem->defname)));
+ }
else
{
ereport(ERROR,
@@ -251,6 +282,112 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* Filter out two-phase transactions, if decoding not enabled. */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ /* treat all transactions as one-phase */
+ if (!data->enable_twophase)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
--
2.15.2 (Apple Git-101.1)
0005-Additional-test-case-to-demonstrate-decoding-rollbac.patchapplication/octet-stream; name=0005-Additional-test-case-to-demonstrate-decoding-rollbac.patchDownload
From 682b0de2827d1f55c4e471c3129eb687ae0825a5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:32:16 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlocking
Introduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided in the plugin, sleep for those many seconds while
inside the "decode change" plugin call. A concurrent rollback is fired
off which aborts that transaction in the meanwhile. A subsequent
systable access will error out causing the logical decoding to abort.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/t/001_twophase.pl | 98 +++++++++++++++++++++++++++++++++
contrib/test_decoding/test_decoding.c | 24 ++++++++
3 files changed, 126 insertions(+), 1 deletion(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..3e68bac3f4
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,98 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. The decode-delay value will allow for each change decode to
+# sleep for those many seconds. We will fire off a ROLLBACK from another
+# session when this delayed decode is ongoing.
+#
+# We will stop decoding immediately post this and the next
+# pg_logical_slot_get_changes call should show only a few records decoded
+# from the entire two phase transaction
+#
+# We use two slots to test multiple decoding backends here
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13,14);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# start decoding the above with decode-delay in the background.
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should only decode 2 INSERT records and should include
+# an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1', 'decode-delay', '3');\" \&");
+
+# sleep for a little while (shorter than decode-delay)
+$node_logical->safe_psql('postgres', "select pg_sleep(1)");
+
+# rollback the prepared transaction whose first record is being decoded
+# after sleeping for decode-delay time
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# wait for decoding to stop
+$node_logical->psql('postgres', "select pg_sleep(4)");
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check for occurrence of log about stopping decoding
+my $output_file = slurp_file($node_logical->logfile());
+my $abort_str = "stopping decoding of test_prepared_tab ";
+like($output_file, qr/$abort_str/, "ABORT found in server log");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'enable-twophase', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot2');");
+$node_logical->stop('fast');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 140010a8b1..7762a290f9 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -37,6 +37,7 @@ typedef struct
bool xact_wrote_changes;
bool only_local;
bool enable_twophase;
+ bool decode_delay; /* seconds to sleep after every change record */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -123,6 +124,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->skip_empty_xacts = false;
data->only_local = false;
data->enable_twophase = false;
+ data->decode_delay = 0;
ctx->output_plugin_private = data;
@@ -214,6 +216,21 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "decode-delay") == 0)
+ {
+ if (elem->arg == NULL)
+ data->decode_delay = 2; /* default to 2 seconds */
+ else
+ data->decode_delay = pg_atoi(strVal(elem->arg),
+ sizeof(int), 0);
+
+ if (data->decode_delay <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -553,6 +570,13 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if decode_delay is specified */
+ if (data->decode_delay > 0)
+ {
+ elog(LOG, "sleeping for %d seconds", data->decode_delay);
+ pg_usleep(data->decode_delay * 1000000L);
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
--
2.15.2 (Apple Git-101.1)
On 2018-07-26 20:24:00 +0530, Nikhil Sontakke wrote:
Hi,
I think we even can just do something like a global
TransactionId check_if_transaction_is_alive = InvalidTransactionId;
and just set it up during decoding. And then just check it whenever it's
not set tot InvalidTransactionId.Ok. I will work on something along these lines and re-submit the set of patches.
PFA, latest patchset, which completely removes the earlier
LogicalLock/LogicalUnLock implementation using groupDecode stuff and
uses the newly suggested approach of checking the currently decoded
XID for abort in systable_* API functions. Much simpler to code and
easier to test as well.
So, leaving the fact that it might not actually be correct aside ;), you
seem to be ok with the approach?
Out of the patchset, the specific patch which focuses on the above
systable_* API based XID checking implementation is part of
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch. So,
it might help to take a look at this patch first for any additional
feedback on this approach.
K.
There's an additional test case in
0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which
uses a sleep in the "change" plugin API to allow a concurrent rollback
on the 2PC being currently decoded. Andres generally doesn't like this
approach :-), but there are no timing/interlocking issues now, and the
sleep just helps us do a concurrent rollback, so it might be ok now,
all things considered. Anyways, it's an additional patch for now.
Yea, I still don't think it's ok. The tests won't be reliable. There's
ways to make this reliable, e.g. by forcing a lock to be acquired that's
externally held or such. Might even be doable just with a weird custom
datatype.
From 75edeb440794fff7de48082dafdecb065940bee5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/5] Gracefully handle concurrent aborts of uncommitted
transactions that are being decoded alongside.When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
src/backend/access/index/genam.c | 31 +++++++++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 30 ++++++++++++++++++++----
src/backend/utils/time/snapmgr.c | 25 ++++++++++++++++++--
src/include/utils/snapmgr.h | 4 +++-
4 files changed, 82 insertions(+), 8 deletions(-)diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 9d08775687..67c5810bf7 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan) else htup = heap_getnext(sysscan->scan, ForwardScanDirection);+ /* + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out + */ + if (TransactionIdIsValid(CheckXidAlive) && + TransactionIdDidAbort(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); + return htup; }
Don't we have to check TransactionIdIsInProgress() first? C.f. header
comments in tqual.c. Note this is also not guaranteed to be correct
after a crash (where no clog entry will exist for an aborted xact), but
we probably shouldn't get here in that case - but better be safe.
I suspect it'd be better reformulated as
TransactionIdIsValid(CheckXidAlive) &&
!TransactionIdIsInProgress(CheckXidAlive) &&
!TransactionIdDidCommit(CheckXidAlive)
What do you think?
I think it'd also be good to add assertions to codepaths not going
through systable_* asserting that
!TransactionIdIsValid(CheckXidAlive). Alternatively we could add an
if (unlikely(TransactionIdIsValid(CheckXidAlive)) && ...)
branch to those too.
From 80fc576bda483798919653991bef6dc198625d90 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PCIncludes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.
FWIW, I don't think I'm ok with doing this on a per-plugin-option basis.
I think this is something that should be known to the outside of the
plugin. More similar to how binary / non-binary support works. Should
also be able to inquire the output plugin whether it's supported (cf
previous similarity).
From 682b0de2827d1f55c4e471c3129eb687ae0825a5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:32:16 +0530
Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback
interlockingIntroduce a decode-delay parameter in the test_decoding plugin. Based
on the value provided in the plugin, sleep for those many seconds while
inside the "decode change" plugin call. A concurrent rollback is fired
off which aborts that transaction in the meanwhile. A subsequent
systable access will error out causing the logical decoding to abort.
Yea, I'm *definitely* still not on board with this. This'll just lead to
a fragile or extremely slow test.
Greetings,
Andres Freund
PFA, latest patchset, which completely removes the earlier
LogicalLock/LogicalUnLock implementation using groupDecode stuff and
uses the newly suggested approach of checking the currently decoded
XID for abort in systable_* API functions. Much simpler to code and
easier to test as well.So, leaving the fact that it might not actually be correct aside ;), you
seem to be ok with the approach?
;-)
Yes, I do like the approach. Do you think there are other locations
other than systable_* APIs which might need such checks?
There's an additional test case in
0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which
uses a sleep in the "change" plugin API to allow a concurrent rollback
on the 2PC being currently decoded. Andres generally doesn't like this
approach :-), but there are no timing/interlocking issues now, and the
sleep just helps us do a concurrent rollback, so it might be ok now,
all things considered. Anyways, it's an additional patch for now.Yea, I still don't think it's ok. The tests won't be reliable. There's
ways to make this reliable, e.g. by forcing a lock to be acquired that's
externally held or such. Might even be doable just with a weird custom
datatype.
Ok, I will look at ways to do away with the sleep.
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 9d08775687..67c5810bf7 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan) else htup = heap_getnext(sysscan->scan, ForwardScanDirection);+ /* + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out + */ + if (TransactionIdIsValid(CheckXidAlive) && + TransactionIdDidAbort(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); + return htup; }Don't we have to check TransactionIdIsInProgress() first? C.f. header
comments in tqual.c. Note this is also not guaranteed to be correct
after a crash (where no clog entry will exist for an aborted xact), but
we probably shouldn't get here in that case - but better be safe.I suspect it'd be better reformulated as
TransactionIdIsValid(CheckXidAlive) &&
!TransactionIdIsInProgress(CheckXidAlive) &&
!TransactionIdDidCommit(CheckXidAlive)What do you think?
tqual.c does seem to mention this for a non-MVCC snapshot, so might as
well do it this ways. The caching of fetched XID should not make these
checks too expensive anyways.
I think it'd also be good to add assertions to codepaths not going
through systable_* asserting that
!TransactionIdIsValid(CheckXidAlive). Alternatively we could add an
if (unlikely(TransactionIdIsValid(CheckXidAlive)) && ...)
branch to those too.
I was wondering if anything else would be needed for user-defined
catalog tables..
From 80fc576bda483798919653991bef6dc198625d90 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PCIncludes a new option "enable_twophase". Depending on this options
value, PREPARE TRANSACTION will either be decoded or treated as
a single phase commit later.FWIW, I don't think I'm ok with doing this on a per-plugin-option basis.
I think this is something that should be known to the outside of the
plugin. More similar to how binary / non-binary support works. Should
also be able to inquire the output plugin whether it's supported (cf
previous similarity).
Hmm, lemme see if we can do it outside of the plugin. But note that a
plugin might want to decode some 2PC at prepare time and another are
"commit prepared" time.
We also need to take care to not break logical replication if the
other node is running non-2PC enabled code. We tried to optimize the
COMMIT/ABORT handling by adding sub flags to the existing protocol. I
will test that as well.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi Nikhil,
Any progress on the issues discussed in the last couple of messages?
That is:
1) removing of the sleep() from tests
2) changes to systable_getnext() wrt. TransactionIdIsInProgress()
3) adding asserts / checks to codepaths not going through systable_*
4) (not) adding this as a per-plugin option
5) handling cases where the downstream does not have 2PC enabled
I guess it'd be good an updated patch or further discussion before
continuing the review efforts.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
Any progress on the issues discussed in the last couple of messages?
That is:1) removing of the sleep() from tests
Done. Now the test_decoding plugin takes a new option "check-xid". We
will pass the XID which is going to be aborted via this option. The
test_decoding plugin will wait for this XID to abort and exit when
that happens. This removes any arbitrary sleep dependencies.
2) changes to systable_getnext() wrt. TransactionIdIsInProgress()
Done.
3) adding asserts / checks to codepaths not going through systable_*
Done. All the heap_* get api calls now assert that they are not being
invoked with a valid
CheckXidAlive value.
4) (not) adding this as a per-plugin option
5) handling cases where the downstream does not have 2PC enabled
struct OutputPluginOptions now has an enable_twophase field which will
be set by the plugin at init time similar to the way output_type is
set to binary/text now.
I guess it'd be good an updated patch or further discussion before
continuing the review efforts.
PFA, latest patchset which implements the above.
Regards,
Nikhil
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 6d7bd6bcda4c9e9c6914e9ec7e27450971b47d64 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/4] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++----------
2 files changed, 37 insertions(+), 30 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 23466bade2..6df6fc0a73 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -680,7 +680,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -700,7 +700,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -723,7 +723,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -783,7 +783,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -799,7 +799,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -1009,7 +1009,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1038,7 +1038,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1204,7 +1204,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1245,7 +1245,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1260,7 +1260,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1277,7 +1277,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1901,7 +1901,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2049,7 +2049,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2156,7 +2156,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2173,7 +2173,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2193,7 +2193,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2314,7 +2314,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7787edf7b6..5b5f4db6d7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.2 (Apple Git-101.1)
0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 90e93341629bb534a619d29cc87713e35358618b Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/4] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 127 ++++++++++-
src/backend/replication/logical/decode.c | 266 ++++++++++++++++++------
src/backend/replication/logical/logical.c | 203 ++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 185 ++++++++++++++--
src/include/replication/logical.h | 2 +-
src/include/replication/output_plugin.h | 46 ++++
src/include/replication/reorderbuffer.h | 68 ++++++
7 files changed, 814 insertions(+), 83 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..a89e4d5184 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +643,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +726,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +780,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e3b05657f8..c60ee90187 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -232,17 +235,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_commit *xlrec;
xl_xact_parsed_commit parsed;
- TransactionId xid;
xlrec = (xl_xact_commit *) XLogRecGetData(r);
ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeCommit(ctx, buf, &parsed, xid);
+ DecodeCommit(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ABORT:
@@ -250,17 +246,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_abort *xlrec;
xl_xact_parsed_abort parsed;
- TransactionId xid;
xlrec = (xl_xact_abort *) XLogRecGetData(r);
ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeAbort(ctx, buf, &parsed, xid);
+ DecodeAbort(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ASSIGNMENT:
@@ -281,16 +270,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->options.enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -556,20 +562,13 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* Consolidated commit record handling between the different form of commit
* records.
*/
-static void
-DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_commit *parsed, TransactionId xid)
+static bool
+DecodeEndOfTxn(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
{
- XLogRecPtr origin_lsn = InvalidXLogRecPtr;
- TimestampTz commit_time = parsed->xact_time;
RepOriginId origin_id = XLogRecGetOrigin(buf->record);
- int i;
+ bool skip = false;
- if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
- {
- origin_lsn = parsed->origin_lsn;
- commit_time = parsed->origin_timestamp;
- }
/*
* Process invalidation messages, even if we're not interested in the
@@ -586,20 +585,24 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);
+ skip = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id);
- /* ----
- * Check whether we are interested in this specific transaction, and tell
- * the reorderbuffer to forget the content of the (sub-)transactions
- * if not.
- *
- * There can be several reasons we might not be interested in this
- * transaction:
- * 1) We might not be interested in decoding transactions up to this
- * LSN. This can happen because we previously decoded it and now just
- * are restarting or if we haven't assembled a consistent snapshot yet.
- * 2) The transaction happened in another database.
- * 3) The output plugin is not interested in the origin.
- * 4) We are doing fast-forwarding
+ return skip;
+}
+
+static void
+FinalizeTxnDecoding(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid,
+ bool will_skip)
+{
+ int i;
+
+
+ /*
+ * Tell the reorderbuffer to forget the content of the (sub-)transactions,
+ * if the transaction doesn't need decoding.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -611,31 +614,128 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* another database, the invalidations might be important, because they
* could be for shared catalogs and we might have loaded data into the
* relevant syscaches.
- * ---
*/
- if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
- ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ if (will_skip)
{
for (i = 0; i < parsed->nsubxacts; i++)
- {
ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
- }
+
ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+ }
+ else
+ {
+ /*
+ * If not skipped, tell the reorderbuffer about the surviving
+ * subtransactions, if the top-level transaction isn't going to be
+ * skipped all together.
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+}
- return;
+static void
+DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = parsed->xact_time;
+ RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = false;
+ bool filter_prepared = false;
+ bool skip;
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
}
- /* tell the reorderbuffer about the surviving subtransactions */
- for (i = 0; i < parsed->nsubxacts; i++)
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ filter_prepared = ReorderBufferPrepareNeedSkip(ctx->reorder,
+ parsed->twophase_xid,
+ parsed->twophase_gid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are committing, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /* Whether or not this COMMIT needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ /*
+ * Finalize the decoding of the transaction here. This is for regular
+ * commits as well as for two-phase transactions the output plugin was not
+ * interested in, which therefore are relayed as normal single-phase
+ * commits.
+ */
+ if (!is_prepared || filter_prepared)
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
+
+ if (skip)
+ return;
+
+ /*
+ * A regular commit simply triggers a replay of transaction changes from
+ * the reorder buffer. For COMMIT PREPARED that however already happened
+ * at PREPARE time, and so we only need to notify the subscriber that the
+ * GID finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (is_prepared && !filter_prepared)
{
- ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
- buf->origptr, buf->endptr);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
}
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ TransactionId xid = parsed->twophase_xid;
+ bool skip;
+
+ Assert(parsed->dbId != InvalidOid);
+ Assert(TransactionIdIsValid(parsed->twophase_xid));
+
+ /* Whether or not this PREPARE needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ if (!skip)
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid);
}
/*
@@ -647,6 +747,48 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = TransactionIdIsValid(parsed->twophase_xid);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ Assert(parsed->dbId != InvalidOid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are aborting, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (is_prepared &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !ctx->fast_forward &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9f99e4f049..be08ccf1cc 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -192,6 +202,11 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
ctx->out = makeStringInfo();
@@ -616,6 +631,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
/* do the actual work: call callback */
ctx->callbacks.startup_cb(ctx, opt, is_init);
+ /*
+ * If the plugin claims to support two-phase transactions, then
+ * check that the plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ if (opt->enable_twophase)
+ {
+ int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+ }
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
}
@@ -713,6 +755,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -790,6 +948,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->options.enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6df6fc0a73..ffcc5c0f6f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1426,25 +1431,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1758,7 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1774,8 +1771,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1802,7 +1813,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1836,6 +1852,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..5fdda65031 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..c9140e7001 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
{
OutputPluginOutputType output_type;
bool receive_rewrites;
+ bool enable_twophase;
} OutputPluginOptions;
/*
@@ -77,6 +78,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -109,7 +150,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b5f4db6d7..ae3cea99b0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -419,6 +473,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -442,6 +501,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.15.2 (Apple Git-101.1)
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From 732247f7cc9c597740493088637ee89aff000569 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/4] Gracefully handle concurrent aborts of uncommitted
transactions that are being decoded alongside.
When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
doc/src/sgml/logicaldecoding.sgml | 5 ++-
src/backend/access/heap/heapam.c | 51 +++++++++++++++++++++++++
src/backend/access/index/genam.c | 35 +++++++++++++++++
src/backend/replication/logical/logical.c | 3 ++
src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++---
src/backend/utils/time/snapmgr.c | 25 +++++++++++-
src/include/utils/snapmgr.h | 4 +-
7 files changed, 146 insertions(+), 9 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a89e4d5184..d76afbda05 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -421,7 +421,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
ALTER TABLE user_catalog_table SET (user_catalog_table = true);
CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
</programlisting>
- Any actions leading to transaction ID assignment are prohibited. That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
includes writing to tables, performing DDL changes, and
calling <literal>txid_current()</literal>.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9650145642..d909ccd65a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1846,6 +1846,17 @@ heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot)
HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_rd) ||
+ RelationIsUsedAsCatalogTable(scan->rs_rd))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_getnext call")));
+
/* Note: no locking manipulations needed */
HEAPDEBUG_1; /* heap_getnext( info ) */
@@ -1926,6 +1937,16 @@ heap_fetch(Relation relation,
OffsetNumber offnum;
bool valid;
+ /*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_fetch call")));
+
/*
* Fetch and pin the appropriate page of the relation.
*/
@@ -2058,6 +2079,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
bool valid;
bool skip;
+ /*
+ * We don't expect direct calls to heap_hot_search_buffer with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search_buffer call")));
+
/* If this is not the first call, previous call returned a (live!) tuple */
if (all_dead)
*all_dead = first_call;
@@ -2199,6 +2230,16 @@ heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
Buffer buffer;
HeapTupleData heapTuple;
+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search call")));
+
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
@@ -2228,6 +2269,16 @@ heap_get_latest_tid(Relation relation,
ItemPointerData ctid;
TransactionId priorXmax;
+ /*
+ * We don't expect direct calls to heap_get_latest_tid with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_get_latest_tid call")));
+
/* this is to avoid Assert failures on bad input */
if (!ItemPointerIsValid(tid))
return;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d08775687..9220dcce83 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/procarray.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -423,6 +424,17 @@ systable_getnext(SysScanDesc sysscan)
else
htup = heap_getnext(sysscan->scan, ForwardScanDirection);
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
@@ -476,6 +488,18 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
}
+
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return result;
}
@@ -593,6 +617,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
if (htup && sysscan->iscan->xs_recheck)
elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index be08ccf1cc..3266bee107 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -763,6 +763,9 @@ abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
LogicalErrorCallbackState state;
ErrorContextCallback errcallback;
+ if (!ctx->callbacks.abort_cb)
+ return;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "abort";
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ffcc5c0f6f..b2f50b604e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
/* setup snapshot to allow catalog access */
- SetupHistoricSnapshot(snapshot_now, NULL);
+ SetupHistoricSnapshot(snapshot_now, NULL, xid);
PG_TRY();
{
rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1442,6 +1442,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
+ MemoryContext ccxt = CurrentMemoryContext;
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
@@ -1468,7 +1469,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
ReorderBufferBuildTupleCidHash(rb, txn);
/* setup the initial snapshot */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Decoding needs access to syscaches et al., which in turn use
@@ -1719,7 +1720,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* and continue with the new one */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1739,7 +1740,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
snapshot_now->curcid = command_id;
TeardownHistoricSnapshot(false);
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Every time the CommandId is incremented, we could
@@ -1824,6 +1825,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
PG_CATCH();
{
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
+
+ /*
+ * if the catalog scan access returned an error of
+ * rollback, then abort on the other side as well
+ */
+ if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ elog(LOG, "stopping decoding of xid %u (gid \"%s\")",
+ txn->xid, txn->gid ? txn->gid : "");
+ rb->abort(rb, txn, commit_lsn);
+ }
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1847,7 +1862,14 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* remove potential on-disk data, and deallocate */
ReorderBufferCleanupTXN(rb, txn);
- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ FlushErrorState();
}
PG_END_TRY();
}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..0354fc9da9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -151,6 +151,13 @@ static Snapshot SecondarySnapshot = NULL;
static Snapshot CatalogSnapshot = NULL;
static Snapshot HistoricSnapshot = NULL;
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
* access to behave just like it did at a certain point in the past.
*
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive. This is to re-check XID status while accessing catalog.
+ *
* Needed for logical decoding.
*/
void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+ TransactionId snapshot_xid)
{
Assert(historic_snapshot != NULL);
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
/* setup (cmin, cmax) lookup hash */
tuplecid_data = tuplecids;
-}
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check
+ * if the xid aborted. That will happen during catalog access.
+ */
+ if (TransactionIdIsValid(snapshot_xid) &&
+ !TransactionIdDidCommit(snapshot_xid))
+ CheckXidAlive = snapshot_xid;
+ else
+ CheckXidAlive = InvalidTransactionId;
+}
/*
* Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
{
HistoricSnapshot = NULL;
tuplecid_data = NULL;
+ CheckXidAlive = InvalidTransactionId;
}
bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..bad2053477 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
/* Support for catalog timetravel for logical decoding */
struct HTAB;
+extern TransactionId CheckXidAlive;
extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+ TransactionId snapshot_xid);
extern void TeardownHistoricSnapshot(bool is_error);
extern bool HistoricSnapshotActive(void);
--
2.15.2 (Apple Git-101.1)
0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From a158f06e8b4c3da1a4b11f7a4adc45bbeb8d5745 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/4] Teach test_decoding plugin to work with 2PC
Implement all callbacks required for decoding 2PC in this test_decoding
plugin. Includes relevant test cases as well.
Additionally, includes a new option "check-xid". If this option points
to a valid xid, then the pg_decode_change() API will wait for it to
be aborted externally. This allows us to test concurrent rollback of
a prepared transaction while it's being actually decoded simultaneously.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/expected/prepared.out | 185 ++++++++++++++++++++++++----
contrib/test_decoding/sql/prepared.sql | 77 ++++++++++--
contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++
contrib/test_decoding/test_decoding.c | 179 +++++++++++++++++++++++++++
5 files changed, 532 insertions(+), 33 deletions(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..934c8f1509 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
BEGIN
table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..60725419fe 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..99a9249689
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13,14);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+ or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+ or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+ my ($expected) = @_;
+
+ $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+ my $max_attempts = 180 * 10;
+ my $attempts = 0;
+
+ my $output_file = '';
+ while ($attempts < $max_attempts)
+ {
+ $output_file = slurp_file($node_logical->logfile());
+
+ if ($output_file =~ $expected)
+ {
+ return 1;
+ }
+
+ # Wait 0.1 second before retrying.
+ usleep(100_000);
+ $attempts++;
+ }
+
+ # The output result didn't change in 180 seconds. Give up
+ return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index f6e77fbda1..bfa43e3653 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
*-------------------------------------------------------------------------
*/
#include "postgres.h"
+#include "miscadmin.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "replication/logical.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +69,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +99,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,11 +126,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->check_xid = InvalidTransactionId;
ctx->output_plugin_private = data;
opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
opt->receive_rewrites = false;
+ /* this plugin supports decoding of 2pc */
+ opt->enable_twophase = true;
foreach(option, ctx->output_plugin_options)
{
@@ -183,6 +210,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "check-xid") == 0)
+ {
+ if (elem->arg)
+ {
+ errno = 0;
+ data->check_xid = (TransactionId)
+ strtoul(strVal(elem->arg), NULL, 0);
+
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid is not a valid number: \"%s\"",
+ strVal(elem->arg))));
+ }
+ else
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid needs an input value")));
+
+ if (data->check_xid <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -251,6 +304,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ if (strstr(gid, "_nodecode") != NULL)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,6 +572,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(data->check_xid))
+ {
+ CHECK_FOR_INTERRUPTS();
+ pg_usleep(10000L);
+ }
+ if (!TransactionIdIsInProgress(data->check_xid) &&
+ !TransactionIdDidCommit(data->check_xid))
+ elog(LOG, "%u aborted", data->check_xid);
+
+ Assert(TransactionIdDidAbort(data->check_xid));
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
--
2.15.2 (Apple Git-101.1)
Hi,
PFA, latest patchset which implements the above.
The newly added test_decoding test was failing due to a slight
expected output mismatch. The attached patch-set corrects that.
Regards,
Nikhil
Regards,
Nikhilregards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.Nov30.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.Nov30.patchDownload
From 6d7bd6bcda4c9e9c6914e9ec7e27450971b47d64 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/4] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++----------
2 files changed, 37 insertions(+), 30 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 23466ba..6df6fc0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -680,7 +680,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -700,7 +700,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -723,7 +723,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -783,7 +783,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -799,7 +799,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -1009,7 +1009,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1038,7 +1038,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1204,7 +1204,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1245,7 +1245,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1260,7 +1260,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1277,7 +1277,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1901,7 +1901,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2049,7 +2049,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2156,7 +2156,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2173,7 +2173,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2193,7 +2193,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2314,7 +2314,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7787edf..5b5f4db 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -230,15 +246,6 @@ typedef struct ReorderBufferTXN
uint64 nentries_mem;
/*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
- /*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
*/
--
2.7.4
0002-Support-decoding-of-two-phase-transactions-at-PREPAR.Nov30.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.Nov30.patchDownload
From 90e93341629bb534a619d29cc87713e35358618b Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/4] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 127 ++++++++++-
src/backend/replication/logical/decode.c | 266 ++++++++++++++++++------
src/backend/replication/logical/logical.c | 203 ++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 185 ++++++++++++++--
src/include/replication/logical.h | 2 +-
src/include/replication/output_plugin.h | 46 ++++
src/include/replication/reorderbuffer.h | 68 ++++++
7 files changed, 814 insertions(+), 83 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db9686..a89e4d5 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +643,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +726,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +780,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e3b0565..c60ee90 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -232,17 +235,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_commit *xlrec;
xl_xact_parsed_commit parsed;
- TransactionId xid;
xlrec = (xl_xact_commit *) XLogRecGetData(r);
ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeCommit(ctx, buf, &parsed, xid);
+ DecodeCommit(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ABORT:
@@ -250,17 +246,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_abort *xlrec;
xl_xact_parsed_abort parsed;
- TransactionId xid;
xlrec = (xl_xact_abort *) XLogRecGetData(r);
ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeAbort(ctx, buf, &parsed, xid);
+ DecodeAbort(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ASSIGNMENT:
@@ -281,16 +270,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->options.enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -556,20 +562,13 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* Consolidated commit record handling between the different form of commit
* records.
*/
-static void
-DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_commit *parsed, TransactionId xid)
+static bool
+DecodeEndOfTxn(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
{
- XLogRecPtr origin_lsn = InvalidXLogRecPtr;
- TimestampTz commit_time = parsed->xact_time;
RepOriginId origin_id = XLogRecGetOrigin(buf->record);
- int i;
+ bool skip = false;
- if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
- {
- origin_lsn = parsed->origin_lsn;
- commit_time = parsed->origin_timestamp;
- }
/*
* Process invalidation messages, even if we're not interested in the
@@ -586,20 +585,24 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);
+ skip = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id);
- /* ----
- * Check whether we are interested in this specific transaction, and tell
- * the reorderbuffer to forget the content of the (sub-)transactions
- * if not.
- *
- * There can be several reasons we might not be interested in this
- * transaction:
- * 1) We might not be interested in decoding transactions up to this
- * LSN. This can happen because we previously decoded it and now just
- * are restarting or if we haven't assembled a consistent snapshot yet.
- * 2) The transaction happened in another database.
- * 3) The output plugin is not interested in the origin.
- * 4) We are doing fast-forwarding
+ return skip;
+}
+
+static void
+FinalizeTxnDecoding(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid,
+ bool will_skip)
+{
+ int i;
+
+
+ /*
+ * Tell the reorderbuffer to forget the content of the (sub-)transactions,
+ * if the transaction doesn't need decoding.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -611,31 +614,128 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* another database, the invalidations might be important, because they
* could be for shared catalogs and we might have loaded data into the
* relevant syscaches.
- * ---
*/
- if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
- ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ if (will_skip)
{
for (i = 0; i < parsed->nsubxacts; i++)
- {
ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
- }
+
ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+ }
+ else
+ {
+ /*
+ * If not skipped, tell the reorderbuffer about the surviving
+ * subtransactions, if the top-level transaction isn't going to be
+ * skipped all together.
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+}
- return;
+static void
+DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = parsed->xact_time;
+ RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = false;
+ bool filter_prepared = false;
+ bool skip;
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
}
- /* tell the reorderbuffer about the surviving subtransactions */
- for (i = 0; i < parsed->nsubxacts; i++)
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ filter_prepared = ReorderBufferPrepareNeedSkip(ctx->reorder,
+ parsed->twophase_xid,
+ parsed->twophase_gid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are committing, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /* Whether or not this COMMIT needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ /*
+ * Finalize the decoding of the transaction here. This is for regular
+ * commits as well as for two-phase transactions the output plugin was not
+ * interested in, which therefore are relayed as normal single-phase
+ * commits.
+ */
+ if (!is_prepared || filter_prepared)
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
+
+ if (skip)
+ return;
+
+ /*
+ * A regular commit simply triggers a replay of transaction changes from
+ * the reorder buffer. For COMMIT PREPARED that however already happened
+ * at PREPARE time, and so we only need to notify the subscriber that the
+ * GID finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (is_prepared && !filter_prepared)
{
- ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
- buf->origptr, buf->endptr);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
}
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ TransactionId xid = parsed->twophase_xid;
+ bool skip;
+
+ Assert(parsed->dbId != InvalidOid);
+ Assert(TransactionIdIsValid(parsed->twophase_xid));
+
+ /* Whether or not this PREPARE needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ if (!skip)
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid);
}
/*
@@ -647,6 +747,48 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = TransactionIdIsValid(parsed->twophase_xid);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ Assert(parsed->dbId != InvalidOid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are aborting, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (is_prepared &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !ctx->fast_forward &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9f99e4f..be08ccf 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -192,6 +202,11 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
ctx->out = makeStringInfo();
@@ -616,6 +631,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
/* do the actual work: call callback */
ctx->callbacks.startup_cb(ctx, opt, is_init);
+ /*
+ * If the plugin claims to support two-phase transactions, then
+ * check that the plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ if (opt->enable_twophase)
+ {
+ int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+ }
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
}
@@ -714,6 +756,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
}
static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
@@ -790,6 +948,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->options.enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6df6fc0..ffcc5c0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1426,25 +1431,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1758,7 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1774,8 +1771,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1802,7 +1813,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1836,6 +1852,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1f..5fdda65 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56..c9140e7 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
{
OutputPluginOutputType output_type;
bool receive_rewrites;
+ bool enable_twophase;
} OutputPluginOptions;
/*
@@ -78,6 +79,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
XLogRecPtr commit_lsn);
/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+/*
* Called for the generic logical decoding messages.
*/
typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -109,7 +150,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5b5f4db..ae3cea9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -419,6 +473,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -442,6 +501,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.7.4
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Nov30.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Nov30.patchDownload
From 732247f7cc9c597740493088637ee89aff000569 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/4] Gracefully handle concurrent aborts of uncommitted
transactions that are being decoded alongside.
When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
doc/src/sgml/logicaldecoding.sgml | 5 ++-
src/backend/access/heap/heapam.c | 51 +++++++++++++++++++++++++
src/backend/access/index/genam.c | 35 +++++++++++++++++
src/backend/replication/logical/logical.c | 3 ++
src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++---
src/backend/utils/time/snapmgr.c | 25 +++++++++++-
src/include/utils/snapmgr.h | 4 +-
7 files changed, 146 insertions(+), 9 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a89e4d5..d76afbd 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -421,7 +421,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
ALTER TABLE user_catalog_table SET (user_catalog_table = true);
CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
</programlisting>
- Any actions leading to transaction ID assignment are prohibited. That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
includes writing to tables, performing DDL changes, and
calling <literal>txid_current()</literal>.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9650145..d909ccd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1846,6 +1846,17 @@ heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot)
HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_rd) ||
+ RelationIsUsedAsCatalogTable(scan->rs_rd))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_getnext call")));
+
/* Note: no locking manipulations needed */
HEAPDEBUG_1; /* heap_getnext( info ) */
@@ -1927,6 +1938,16 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_fetch call")));
+
+ /*
* Fetch and pin the appropriate page of the relation.
*/
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
@@ -2058,6 +2079,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
bool valid;
bool skip;
+ /*
+ * We don't expect direct calls to heap_hot_search_buffer with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search_buffer call")));
+
/* If this is not the first call, previous call returned a (live!) tuple */
if (all_dead)
*all_dead = first_call;
@@ -2199,6 +2230,16 @@ heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
Buffer buffer;
HeapTupleData heapTuple;
+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search call")));
+
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
@@ -2228,6 +2269,16 @@ heap_get_latest_tid(Relation relation,
ItemPointerData ctid;
TransactionId priorXmax;
+ /*
+ * We don't expect direct calls to heap_get_latest_tid with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_get_latest_tid call")));
+
/* this is to avoid Assert failures on bad input */
if (!ItemPointerIsValid(tid))
return;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d08775..9220dcc 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/procarray.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -423,6 +424,17 @@ systable_getnext(SysScanDesc sysscan)
else
htup = heap_getnext(sysscan->scan, ForwardScanDirection);
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
@@ -476,6 +488,18 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
}
+
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return result;
}
@@ -593,6 +617,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
if (htup && sysscan->iscan->xs_recheck)
elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index be08ccf..3266bee 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -763,6 +763,9 @@ abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
LogicalErrorCallbackState state;
ErrorContextCallback errcallback;
+ if (!ctx->callbacks.abort_cb)
+ return;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "abort";
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index ffcc5c0..b2f50b6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
/* setup snapshot to allow catalog access */
- SetupHistoricSnapshot(snapshot_now, NULL);
+ SetupHistoricSnapshot(snapshot_now, NULL, xid);
PG_TRY();
{
rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1442,6 +1442,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
+ MemoryContext ccxt = CurrentMemoryContext;
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
@@ -1468,7 +1469,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
ReorderBufferBuildTupleCidHash(rb, txn);
/* setup the initial snapshot */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Decoding needs access to syscaches et al., which in turn use
@@ -1719,7 +1720,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* and continue with the new one */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1739,7 +1740,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
snapshot_now->curcid = command_id;
TeardownHistoricSnapshot(false);
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Every time the CommandId is incremented, we could
@@ -1824,6 +1825,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
PG_CATCH();
{
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
+
+ /*
+ * if the catalog scan access returned an error of
+ * rollback, then abort on the other side as well
+ */
+ if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ elog(LOG, "stopping decoding of xid %u (gid \"%s\")",
+ txn->xid, txn->gid ? txn->gid : "");
+ rb->abort(rb, txn, commit_lsn);
+ }
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1847,7 +1862,14 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* remove potential on-disk data, and deallocate */
ReorderBufferCleanupTXN(rb, txn);
- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ FlushErrorState();
}
PG_END_TRY();
}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59ef..0354fc9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -152,6 +152,13 @@ static Snapshot CatalogSnapshot = NULL;
static Snapshot HistoricSnapshot = NULL;
/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
+/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
* mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
* access to behave just like it did at a certain point in the past.
*
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive. This is to re-check XID status while accessing catalog.
+ *
* Needed for logical decoding.
*/
void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+ TransactionId snapshot_xid)
{
Assert(historic_snapshot != NULL);
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
/* setup (cmin, cmax) lookup hash */
tuplecid_data = tuplecids;
-}
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check
+ * if the xid aborted. That will happen during catalog access.
+ */
+ if (TransactionIdIsValid(snapshot_xid) &&
+ !TransactionIdDidCommit(snapshot_xid))
+ CheckXidAlive = snapshot_xid;
+ else
+ CheckXidAlive = InvalidTransactionId;
+}
/*
* Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
{
HistoricSnapshot = NULL;
tuplecid_data = NULL;
+ CheckXidAlive = InvalidTransactionId;
}
bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3..bad2053 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
/* Support for catalog timetravel for logical decoding */
struct HTAB;
+extern TransactionId CheckXidAlive;
extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+ TransactionId snapshot_xid);
extern void TeardownHistoricSnapshot(bool is_error);
extern bool HistoricSnapshotActive(void);
--
2.7.4
0004-Teach-test_decoding-plugin-to-work-with-2PC.Nov30.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.Nov30.patchDownload
From c82a74ec637a4d84826958260d7702fb9f880dc5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/4] Teach test_decoding plugin to work with 2PC
Implement all callbacks required for decoding 2PC in this test_decoding
plugin. Includes relevant test cases as well.
Additionally, includes a new option "check-xid". If this option points
to a valid xid, then the pg_decode_change() API will wait for it to
be aborted externally. This allows us to test concurrent rollback of
a prepared transaction while it's being actually decoded simultaneously.
---
contrib/test_decoding/Makefile | 5 +-
contrib/test_decoding/expected/prepared.out | 185 ++++++++++++++++++++++++----
contrib/test_decoding/sql/prepared.sql | 77 ++++++++++--
contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++
contrib/test_decoding/test_decoding.c | 179 +++++++++++++++++++++++++++
5 files changed, 532 insertions(+), 33 deletions(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab93..3f0b1c6 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
# installation, allow to do so, but only if requested explicitly.
installcheck-force: regresscheck-install-force isolationcheck-install-force
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
submake-regress:
$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
isolationcheck isolationcheck-install-force
temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+ $(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..934c8f1 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..6072541 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..50f269b
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13,14);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+ or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of xid $xid2pc")
+ or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+ my ($expected) = @_;
+
+ $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+ my $max_attempts = 180 * 10;
+ my $attempts = 0;
+
+ my $output_file = '';
+ while ($attempts < $max_attempts)
+ {
+ $output_file = slurp_file($node_logical->logfile());
+
+ if ($output_file =~ $expected)
+ {
+ return 1;
+ }
+
+ # Wait 0.1 second before retrying.
+ usleep(100_000);
+ $attempts++;
+ }
+
+ # The output result didn't change in 180 seconds. Give up
+ return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index f6e77fb..bfa43e3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
*-------------------------------------------------------------------------
*/
#include "postgres.h"
+#include "miscadmin.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "replication/logical.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +69,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +99,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,11 +126,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->check_xid = InvalidTransactionId;
ctx->output_plugin_private = data;
opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
opt->receive_rewrites = false;
+ /* this plugin supports decoding of 2pc */
+ opt->enable_twophase = true;
foreach(option, ctx->output_plugin_options)
{
@@ -183,6 +210,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "check-xid") == 0)
+ {
+ if (elem->arg)
+ {
+ errno = 0;
+ data->check_xid = (TransactionId)
+ strtoul(strVal(elem->arg), NULL, 0);
+
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid is not a valid number: \"%s\"",
+ strVal(elem->arg))));
+ }
+ else
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid needs an input value")));
+
+ if (data->check_xid <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -251,6 +304,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ if (strstr(gid, "_nodecode") != NULL)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,6 +572,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(data->check_xid))
+ {
+ CHECK_FOR_INTERRUPTS();
+ pg_usleep(10000L);
+ }
+ if (!TransactionIdIsInProgress(data->check_xid) &&
+ !TransactionIdDidCommit(data->check_xid))
+ elog(LOG, "%u aborted", data->check_xid);
+
+ Assert(TransactionIdDidAbort(data->check_xid));
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
--
2.7.4
Hi Nikhil,
Thanks for the updated patch - I've started working on a review, with
the hope of getting it committed sometime in 2019-01. But the patch
bit-rotted again a bit (probably due to d3c09b9b), which broke the last
part. Can you post a fixed version?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Hi Nikhil,
Thanks for the updated patch - I've started working on a review, with
the hope of getting it committed sometime in 2019-01. But the patch
bit-rotted again a bit (probably due to d3c09b9b), which broke the last
part. Can you post a fixed version?
Please also note that at some time the thread was torn and continued in
another place:
/messages/by-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
And now we have two branches =(
I hadn't checked whether my concerns where addressed in the latest
version though.
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 12/18/18 10:28 AM, Arseny Sher wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Hi Nikhil,
Thanks for the updated patch - I've started working on a review, with
the hope of getting it committed sometime in 2019-01. But the patch
bit-rotted again a bit (probably due to d3c09b9b), which broke the last
part. Can you post a fixed version?Please also note that at some time the thread was torn and continued in
another place:
/messages/by-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.comAnd now we have two branches =(
Thanks for pointing that out - I've added the other thread to the CF
entry, so that we don't loose it.
I hadn't checked whether my concerns where addressed in the latest
version though.
OK, I'll read through the other thread and will check. Or perhaps Nikhil
can comment on that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
Thanks for the updated patch - I've started working on a review, with
the hope of getting it committed sometime in 2019-01. But the patch
bit-rotted again a bit (probably due to d3c09b9b), which broke the last
part. Can you post a fixed version?
PFA, updated patch set.
Regards,
Nikhil
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Nikhil Sontakke
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/
Attachments:
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.Jan4.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.Jan4.patchDownload
From 911d6fe63978e031b84f6dc8bb9be1533c8776e8 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/4] Cleaning up of flags in ReorderBufferTXN structure
---
src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
src/include/replication/reorderbuffer.h | 33 ++++++++++++++----------
2 files changed, 37 insertions(+), 30 deletions(-)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b79ce5db95..3d287c0eb7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -680,7 +680,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_first_lsn < cur_txn->first_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_first_lsn = cur_txn->first_lsn;
}
@@ -700,7 +700,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
/* known-as-subtxn txns must not be listed */
- Assert(!cur_txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(cur_txn));
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
}
@@ -723,7 +723,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
- Assert(!txn->is_known_as_subxact);
+ Assert(!rbtxn_is_known_subxact(txn));
Assert(txn->first_lsn != InvalidXLogRecPtr);
return txn;
}
@@ -783,7 +783,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
if (!new_sub)
{
- if (subtxn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(subtxn))
{
/* already associated, nothing to do */
return;
@@ -799,7 +799,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
}
}
- subtxn->is_known_as_subxact = true;
+ subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
Assert(subtxn->nsubtxns == 0);
@@ -1009,7 +1009,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, txn);
@@ -1038,7 +1038,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
ReorderBufferChange *cur_change;
- if (cur_txn->serialized)
+ if (rbtxn_is_serialized(cur_txn))
{
/* serialize remaining changes */
ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1204,7 +1204,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
* they originally were happening inside another subtxn, so we won't
* ever recurse more than one level deep here.
*/
- Assert(subtxn->is_known_as_subxact);
+ Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);
ReorderBufferCleanupTXN(rb, subtxn);
@@ -1245,7 +1245,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
/*
* Remove TXN from its containing list.
*
- * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+ * Note: if txn is known as subxact, we are deleting the TXN from its
* parent's list of known subxacts; this leaves the parent's nsubxacts
* count too high, but we don't care. Otherwise, we are deleting the TXN
* from the LSN-ordered list of toplevel TXNs.
@@ -1260,7 +1260,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(found);
/* remove entries spilled to disk */
- if (txn->serialized)
+ if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);
/* deallocate */
@@ -1277,7 +1277,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+ if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;
memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1901,7 +1901,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
* final_lsn to that of their last change; this causes
* ReorderBufferRestoreCleanup to do the right thing.
*/
- if (txn->serialized && txn->final_lsn == 0)
+ if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2049,7 +2049,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
* operate on its top-level transaction instead.
*/
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
Assert(txn->base_snapshot == NULL);
@@ -2156,7 +2156,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
- txn->has_catalog_changes = true;
+ txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
/*
@@ -2173,7 +2173,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
if (txn == NULL)
return false;
- return txn->has_catalog_changes;
+ return rbtxn_has_catalog_changes(txn);
}
/*
@@ -2193,7 +2193,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
return false;
/* a known subtxn? operate on top-level txn instead */
- if (txn->is_known_as_subxact)
+ if (rbtxn_is_known_subxact(txn))
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
NULL, InvalidXLogRecPtr, false);
@@ -2314,7 +2314,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
- txn->serialized = true;
+ txn->txn_flags |= RBTXN_IS_SERIALIZED;
if (fd != -1)
CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 7646b60f94..a67b2fd1d9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
dlist_node node;
} ReorderBufferChange;
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT 0x0002
+#define RBTXN_IS_SERIALIZED 0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn) (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk? It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
typedef struct ReorderBufferTXN
{
+ int txn_flags;
+
/*
* The transactions transaction id, can be a toplevel or sub xid.
*/
TransactionId xid;
- /* did the TX have catalog changes */
- bool has_catalog_changes;
-
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
*/
uint64 nentries_mem;
- /*
- * Has this transaction been spilled to disk? It's not always possible to
- * deduce that fact by comparing nentries with nentries_mem, because e.g.
- * subtransactions of a large transaction might get serialized together
- * with the parent - if they're restored to memory they'd have
- * nentries_mem == nentries.
- */
- bool serialized;
-
/*
* List of ReorderBufferChange structs, including new Snapshots and new
* CommandIds
--
2.15.2 (Apple Git-101.1)
0002-Support-decoding-of-two-phase-transactions-at-PREPAR.Jan4.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.Jan4.patchDownload
From 2ba4fa4cc543b2bc4b731c96b6dbec45dff0fbcd Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/4] Support decoding of two-phase transactions at PREPARE
Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.
This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.
Includes documentation changes.
---
doc/src/sgml/logicaldecoding.sgml | 127 ++++++++++-
src/backend/replication/logical/decode.c | 266 ++++++++++++++++++------
src/backend/replication/logical/logical.c | 203 ++++++++++++++++++
src/backend/replication/logical/reorderbuffer.c | 185 ++++++++++++++--
src/include/replication/logical.h | 2 +-
src/include/replication/output_plugin.h | 46 ++++
src/include/replication/reorderbuffer.h | 68 ++++++
7 files changed, 814 insertions(+), 83 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..a89e4d5184 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
never get
decoded. Successful savepoints are
folded into the transaction containing them in the order they were
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
</para>
<note>
@@ -558,6 +569,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-prepare">
+ <title>Transaction Prepare Callback</title>
+
+ <para>
+ The optional <function>prepare_cb</function> callback is called whenever
+ a transaction which is prepared for two-phase commit has been
+ decoded. The <function>change_cb</function> callbacks for all modified
+ rows will have been called before this, if there have been any modified
+ rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+ <title>Commit Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+ <title>Rollback Prepared Transaction Callback</title>
+
+ <para>
+ The optional <function>abort_prepared_cb</function> callback is called whenever
+ a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+ which is part of the <parameter>txn</parameter> parameter can be used in this
+ callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-abort">
+ <title>Transaction Abort Callback</title>
+
+ <para>
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
+ decoding a transaction that has been prepared for two-phase commit and
+ a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-change">
<title>Change Callback</title>
@@ -567,7 +643,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
an <command>INSERT</command>, <command>UPDATE</command>,
or <command>DELETE</command>. Even if the original command modified
several rows at once the callback will be called individually for each
- row.
+ row. The <function>change_cb</function> callback may access system or
+ user catalog tables to aid in the process of outputting the row
+ modification details. In case of decoding a prepared (but yet
+ uncommitted) transaction or decoding of an uncommitted transaction, this
+ change callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
<programlisting>
typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
@@ -644,6 +726,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
</para>
</sect3>
+ <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+ <title>Prepare Filter Callback</title>
+
+ <para>
+ The optional <function>filter_prepare_cb</function> callback
+ is called to determine whether data that is part of the current
+ two-phase commit transaction should be considered for decode
+ at this prepare stage or as a regular one-phase transaction at
+ <command>COMMIT PREPARED</command> time later. To signal that
+ decoding should be skipped, return <literal>true</literal>;
+ <literal>false</literal> otherwise. When the callback is not
+ defined, <literal>false</literal> is assumed (i.e. nothing is
+ filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+</programlisting>
+ The <parameter>ctx</parameter> parameter has the same contents
+ as for the other callbacks. The <parameter>txn</parameter> parameter
+ contains meta information about the transaction. The <parameter>xid</parameter>
+ contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+ The <parameter>gid</parameter> is the identifier that later identifies this
+ transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+ </para>
+ <para>
+ The callback has to provide the same static answer for a given combination of
+ <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+ called.
+ </para>
+ </sect3>
+
<sect3 id="logicaldecoding-output-plugin-message">
<title>Generic Message Callback</title>
@@ -665,7 +780,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
non-transactional and the XID was not assigned yet in the transaction
which logged the message. The <parameter>lsn</parameter> has WAL
location of the message. The <parameter>transactional</parameter> says
- if the message was sent as transactional or not.
+ if the message was sent as transactional or not. Similar to the change
+ callback, in case of decoding a prepared (but yet uncommitted)
+ transaction or decoding of an uncommitted transaction, this message
+ callback might also error out due to simultaneous rollback of
+ this very same transaction. In that case, the logical decoding of this
+ aborted transaction is stopped gracefully.
+
The <parameter>prefix</parameter> is arbitrary null-terminated prefix
which can be used for identifying interesting messages for the current
plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eec3a22842..3ae80d9c06 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
#include "access/xlogutils.h"
#include "access/xlogreader.h"
#include "access/xlogrecord.h"
+#include "access/twophase.h"
#include "catalog/pg_control.h"
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);
/* common function to decode tuples */
static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -232,17 +235,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_commit *xlrec;
xl_xact_parsed_commit parsed;
- TransactionId xid;
xlrec = (xl_xact_commit *) XLogRecGetData(r);
ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeCommit(ctx, buf, &parsed, xid);
+ DecodeCommit(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ABORT:
@@ -250,17 +246,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
xl_xact_abort *xlrec;
xl_xact_parsed_abort parsed;
- TransactionId xid;
xlrec = (xl_xact_abort *) XLogRecGetData(r);
ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
-
- if (!TransactionIdIsValid(parsed.twophase_xid))
- xid = XLogRecGetXid(r);
- else
- xid = parsed.twophase_xid;
-
- DecodeAbort(ctx, buf, &parsed, xid);
+ DecodeAbort(ctx, buf, &parsed, XLogRecGetXid(r));
break;
}
case XLOG_XACT_ASSIGNMENT:
@@ -281,16 +270,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
break;
}
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
- /*
- * Currently decoding ignores PREPARE TRANSACTION and will just
- * decode the transaction when the COMMIT PREPARED is sent or
- * throw away the transaction's contents when a ROLLBACK PREPARED
- * is received. In the future we could add code to expose prepared
- * transactions in the changestream allowing for a kind of
- * distributed 2PC.
- */
- ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->options.enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }
+
+ /* ok, parse it */
+ ParsePrepareRecord(XLogRecGetInfo(buf->record),
+ XLogRecGetData(buf->record), &parsed);
+
+ /* does output plugin want this particular transaction? */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+ parsed.twophase_gid))
+ {
+ ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+ buf->origptr);
+ break;
+ }
+
+ DecodePrepare(ctx, buf, &parsed);
+ break;
+ }
break;
default:
elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -556,20 +562,13 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
* Consolidated commit record handling between the different form of commit
* records.
*/
-static void
-DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
- xl_xact_parsed_commit *parsed, TransactionId xid)
+static bool
+DecodeEndOfTxn(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
{
- XLogRecPtr origin_lsn = InvalidXLogRecPtr;
- TimestampTz commit_time = parsed->xact_time;
RepOriginId origin_id = XLogRecGetOrigin(buf->record);
- int i;
+ bool skip = false;
- if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
- {
- origin_lsn = parsed->origin_lsn;
- commit_time = parsed->origin_timestamp;
- }
/*
* Process invalidation messages, even if we're not interested in the
@@ -586,20 +585,24 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts);
+ skip = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id);
- /* ----
- * Check whether we are interested in this specific transaction, and tell
- * the reorderbuffer to forget the content of the (sub-)transactions
- * if not.
- *
- * There can be several reasons we might not be interested in this
- * transaction:
- * 1) We might not be interested in decoding transactions up to this
- * LSN. This can happen because we previously decoded it and now just
- * are restarting or if we haven't assembled a consistent snapshot yet.
- * 2) The transaction happened in another database.
- * 3) The output plugin is not interested in the origin.
- * 4) We are doing fast-forwarding
+ return skip;
+}
+
+static void
+FinalizeTxnDecoding(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid,
+ bool will_skip)
+{
+ int i;
+
+
+ /*
+ * Tell the reorderbuffer to forget the content of the (sub-)transactions,
+ * if the transaction doesn't need decoding.
*
* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
@@ -611,31 +614,128 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
* another database, the invalidations might be important, because they
* could be for shared catalogs and we might have loaded data into the
* relevant syscaches.
- * ---
*/
- if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
- (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
- ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ if (will_skip)
{
for (i = 0; i < parsed->nsubxacts; i++)
- {
ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
- }
+
ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+ }
+ else
+ {
+ /*
+ * If not skipped, tell the reorderbuffer about the surviving
+ * subtransactions, if the top-level transaction isn't going to be
+ * skipped all together.
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+}
- return;
+static void
+DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = parsed->xact_time;
+ RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = false;
+ bool filter_prepared = false;
+ bool skip;
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
}
- /* tell the reorderbuffer about the surviving subtransactions */
- for (i = 0; i < parsed->nsubxacts; i++)
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ filter_prepared = ReorderBufferPrepareNeedSkip(ctx->reorder,
+ parsed->twophase_xid,
+ parsed->twophase_gid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are committing, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /* Whether or not this COMMIT needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ /*
+ * Finalize the decoding of the transaction here. This is for regular
+ * commits as well as for two-phase transactions the output plugin was not
+ * interested in, which therefore are relayed as normal single-phase
+ * commits.
+ */
+ if (!is_prepared || filter_prepared)
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
+
+ if (skip)
+ return;
+
+ /*
+ * A regular commit simply triggers a replay of transaction changes from
+ * the reorder buffer. For COMMIT PREPARED that however already happened
+ * at PREPARE time, and so we only need to notify the subscriber that the
+ * GID finally committed.
+ *
+ * For output plugins that do not support PREPARE-time decoding of
+ * two-phase transactions, we never even see the PREPARE and all two-phase
+ * transactions simply fall through to the second branch.
+ */
+ if (is_prepared && !filter_prepared)
{
- ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
- buf->origptr, buf->endptr);
+ /* we are processing COMMIT PREPARED */
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
}
+ else
+ {
+ /* replay actions of all transaction + subtransactions in order */
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed)
+{
+ XLogRecPtr origin_lsn = parsed->origin_lsn;
+ TimestampTz commit_time = parsed->origin_timestamp;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ TransactionId xid = parsed->twophase_xid;
+ bool skip;
+
+ Assert(parsed->dbId != InvalidOid);
+ Assert(TransactionIdIsValid(parsed->twophase_xid));
+
+ /* Whether or not this PREPARE needs to be skipped. */
+ skip = DecodeEndOfTxn(ctx, buf, parsed, xid);
+
+ FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
/* replay actions of all transaction + subtransactions in order */
- ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
- commit_time, origin_id, origin_lsn);
+ if (!skip)
+ ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid);
}
/*
@@ -647,6 +747,48 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
+ XLogRecPtr origin_lsn = InvalidXLogRecPtr;
+ TimestampTz commit_time = 0;
+ XLogRecPtr origin_id = XLogRecGetOrigin(buf->record);
+ bool is_prepared = TransactionIdIsValid(parsed->twophase_xid);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ origin_lsn = parsed->origin_lsn;
+ commit_time = parsed->origin_timestamp;
+ }
+
+ if (TransactionIdIsValid(parsed->twophase_xid))
+ {
+ is_prepared = true;
+ Assert(parsed->dbId != InvalidOid);
+
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are aborting, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
+
+ /* act on the prepared transaction, instead */
+ xid = parsed->twophase_xid;
+ }
+
+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (is_prepared &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !ctx->fast_forward &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }
for (i = 0; i < parsed->nsubxacts; i++)
{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e5bc12e77..2369ff1d53 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change);
static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -192,6 +202,11 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder->apply_change = change_cb_wrapper;
ctx->reorder->apply_truncate = truncate_cb_wrapper;
ctx->reorder->commit = commit_cb_wrapper;
+ ctx->reorder->abort = abort_cb_wrapper;
+ ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+ ctx->reorder->prepare = prepare_cb_wrapper;
+ ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+ ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
ctx->reorder->message = message_cb_wrapper;
ctx->out = makeStringInfo();
@@ -616,6 +631,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
/* do the actual work: call callback */
ctx->callbacks.startup_cb(ctx, opt, is_init);
+ /*
+ * If the plugin claims to support two-phase transactions, then
+ * check that the plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ if (opt->enable_twophase)
+ {
+ int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+ }
+
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
+
/* Pop the error context stack */
error_context_stack = errcallback.previous;
}
@@ -713,6 +755,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort";
+ state.report_location = txn->final_lsn; /* beginning of abort record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "prepare";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "commit_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "abort_prepared";
+ state.report_location = txn->final_lsn; /* beginning of commit record */
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = true;
+ ctx->write_xid = txn->xid;
+ ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+ /* do the actual work: call callback */
+ ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
+
static void
change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
@@ -790,6 +948,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+ bool ret;
+
+ /*
+ * Skip if decoding of twophase at PREPARE time is not enabled. In that
+ * case all twophase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */
+ if (!ctx->options.enable_twophase)
+ return true;
+
+ /*
+ * The filter_prepare callback is optional. When not supplied, all
+ * prepared transactions should go through.
+ */
+ if (!ctx->callbacks.filter_prepare_cb)
+ return false;
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "filter_prepare";
+ state.report_location = InvalidXLogRecPtr;
+ errcallback.callback = output_plugin_error_callback;
+ errcallback.arg = (void *) &state;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* set output state */
+ ctx->accept_writes = false;
+
+ /* do the actual work: call callback */
+ ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+
+ return ret;
+}
+
bool
filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 3d287c0eb7..918f96b796 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
}
/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }
if (txn->tuplecid_hash != NULL)
{
@@ -1426,25 +1431,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
* and subtransactions (using a k-way merge) and replay the changes in lsn
* order.
*/
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
- XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- TimestampTz commit_time,
- RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+ ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
{
- ReorderBufferTXN *txn;
volatile Snapshot snapshot_now;
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
- txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
- false);
-
- /* unknown transaction, nothing to replay */
- if (txn == NULL)
- return;
-
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
txn->commit_time = commit_time;
@@ -1758,7 +1756,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
break;
}
}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
@@ -1774,8 +1771,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;
- /* call commit callback */
- rb->commit(rb, txn, commit_lsn);
+ /*
+ * Call abort/commit/prepare callback, depending on the transaction
+ * state.
+ *
+ * If the transaction aborted during apply (which currently can happen
+ * only for prepared transactions), simply call the abort callback.
+ *
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
+ else if (rbtxn_prepared(txn))
+ rb->prepare(rb, txn, commit_lsn);
+ else
+ rb->commit(rb, txn, commit_lsn);
/* this is just a sanity check against bad output plugin behaviour */
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1802,7 +1813,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
if (snapshot_now->copied)
ReorderBufferFreeSnap(rb, snapshot_now);
- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *
+ * We remove it even for prepared transactions (GID is enough to
+ * commit/abort those later).
+ */
ReorderBufferCleanupTXN(rb, txn);
}
PG_CATCH();
@@ -1836,6 +1852,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_END_TRY();
}
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+ return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /* unknown transaction, nothing to replay */
+ if (txn == NULL)
+ return;
+
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+ commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid)
+{
+ ReorderBufferTXN *txn;
+
+ txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+ false);
+
+ /*
+ * Always call the prepare filter. It's the job of the prepare filter to
+ * give us the *same* response for a given xid across multiple calls
+ * (including ones on restart)
+ */
+ return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);
+
+ txn->final_lsn = commit_lsn;
+ txn->end_lsn = end_lsn;
+ txn->commit_time = commit_time;
+ txn->origin_id = origin_id;
+ txn->origin_lsn = origin_lsn;
+ /* this txn is obviously prepared */
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
+
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+ rb->commit_prepared(rb, txn, commit_lsn);
+ }
+ else
+ {
+ txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+ rb->abort_prepared(rb, txn, commit_lsn);
+ }
+
+ /* cleanup: make sure there's no cache pollution */
+ ReorderBufferExecuteInvalidations(rb, txn);
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
/*
* Abort a transaction that possibly has previous changes. Needs to be first
* called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c8ffc4c434..78cbde74f1 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
/*
* Marks the logical decoding context as fast forward decoding one. Such a
- * context does not have plugin loaded so most of the the following
+ * context does not have plugin loaded so most of the following
* properties are unused.
*/
bool fast_forward;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index d4ce54f26d..02727e9e25 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
{
OutputPluginOutputType output_type;
bool receive_rewrites;
+ bool enable_twophase;
} OutputPluginOptions;
/*
@@ -77,6 +78,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+ /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+ * sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
/*
* Called for the generic logical decoding messages.
*/
@@ -109,7 +150,12 @@ typedef struct OutputPluginCallbacks
LogicalDecodeChangeCB change_cb;
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
+ LogicalDecodeAbortCB abort_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;
+ LogicalDecodePrepareCB prepare_cb;
+ LogicalDecodeCommitPreparedCB commit_prepared_cb;
+ LogicalDecodeAbortPreparedCB abort_prepared_cb;
LogicalDecodeFilterByOriginCB filter_by_origin_cb;
LogicalDecodeShutdownCB shutdown_cb;
} OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a67b2fd1d9..621e595e8e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_ROLLBACK 0x0080
/* does the txn have catalog changes */
#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
* nentries_mem == nentries.
*/
#define rbtxn_is_serialized(txn) (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn) (txn->txn_flags & RBTXN_ROLLBACK)
typedef struct ReorderBufferTXN
{
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
/*
* LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ TransactionId xid,
+ const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+
+
/* message callback signature */
typedef void (*ReorderBufferMessageCB) (
ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
ReorderBufferApplyChangeCB apply_change;
ReorderBufferApplyTruncateCB apply_truncate;
ReorderBufferCommitCB commit;
+ ReorderBufferAbortCB abort;
+ ReorderBufferFilterPrepareCB filter_prepare;
+ ReorderBufferPrepareCB prepare;
+ ReorderBufferCommitPreparedCB commit_prepared;
+ ReorderBufferAbortPreparedCB abort_prepared;
ReorderBufferMessageCB message;
/*
@@ -419,6 +473,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
void ReorderBufferCommit(ReorderBuffer *, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit);
void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -442,6 +501,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
--
2.15.2 (Apple Git-101.1)
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patchDownload
From 763922bde9e16e68ab6a86ebbb1d6529ff5d3983 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/4] Gracefully handle concurrent aborts of uncommitted
transactions that are being decoded alongside.
When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.
When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.
But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).
We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
doc/src/sgml/logicaldecoding.sgml | 5 ++-
src/backend/access/heap/heapam.c | 51 +++++++++++++++++++++++++
src/backend/access/index/genam.c | 35 +++++++++++++++++
src/backend/replication/logical/logical.c | 3 ++
src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++---
src/backend/utils/time/snapmgr.c | 25 +++++++++++-
src/include/utils/snapmgr.h | 4 +-
7 files changed, 146 insertions(+), 9 deletions(-)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a89e4d5184..d76afbda05 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -421,7 +421,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
ALTER TABLE user_catalog_table SET (user_catalog_table = true);
CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
</programlisting>
- Any actions leading to transaction ID assignment are prohibited. That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
includes writing to tables, performing DDL changes, and
calling <literal>txid_current()</literal>.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2c4a145357..f056543808 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1846,6 +1846,17 @@ heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot)
HeapTuple
heap_getnext(HeapScanDesc scan, ScanDirection direction)
{
+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_rd) ||
+ RelationIsUsedAsCatalogTable(scan->rs_rd))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_getnext call")));
+
/* Note: no locking manipulations needed */
HEAPDEBUG_1; /* heap_getnext( info ) */
@@ -1926,6 +1937,16 @@ heap_fetch(Relation relation,
OffsetNumber offnum;
bool valid;
+ /*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_fetch call")));
+
/*
* Fetch and pin the appropriate page of the relation.
*/
@@ -2058,6 +2079,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
bool valid;
bool skip;
+ /*
+ * We don't expect direct calls to heap_hot_search_buffer with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search_buffer call")));
+
/* If this is not the first call, previous call returned a (live!) tuple */
if (all_dead)
*all_dead = first_call;
@@ -2199,6 +2230,16 @@ heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
Buffer buffer;
HeapTupleData heapTuple;
+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_hot_search call")));
+
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(buffer, BUFFER_LOCK_SHARE);
result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
@@ -2228,6 +2269,16 @@ heap_get_latest_tid(Relation relation,
ItemPointerData ctid;
TransactionId priorXmax;
+ /*
+ * We don't expect direct calls to heap_get_latest_tid with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+ errmsg("improper heap_get_latest_tid call")));
+
/* this is to avoid Assert failures on bad input */
if (!ItemPointerIsValid(tid))
return;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4d46257d6a..7564374642 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/procarray.h"
#include "utils/acl.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
@@ -423,6 +424,17 @@ systable_getnext(SysScanDesc sysscan)
else
htup = heap_getnext(sysscan->scan, ForwardScanDirection);
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
@@ -476,6 +488,18 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
}
+
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return result;
}
@@ -593,6 +617,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
if (htup && sysscan->iscan->xs_recheck)
elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
+
return htup;
}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2369ff1d53..151ef8517c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -763,6 +763,9 @@ abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
LogicalErrorCallbackState state;
ErrorContextCallback errcallback;
+ if (!ctx->callbacks.abort_cb)
+ return;
+
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "abort";
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 918f96b796..baf2d1aa30 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -636,7 +636,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
/* setup snapshot to allow catalog access */
- SetupHistoricSnapshot(snapshot_now, NULL);
+ SetupHistoricSnapshot(snapshot_now, NULL, xid);
PG_TRY();
{
rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1442,6 +1442,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
volatile CommandId command_id = FirstCommandId;
bool using_subtxn;
ReorderBufferIterTXNState *volatile iterstate = NULL;
+ MemoryContext ccxt = CurrentMemoryContext;
txn->final_lsn = commit_lsn;
txn->end_lsn = end_lsn;
@@ -1468,7 +1469,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
ReorderBufferBuildTupleCidHash(rb, txn);
/* setup the initial snapshot */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Decoding needs access to syscaches et al., which in turn use
@@ -1719,7 +1720,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* and continue with the new one */
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
break;
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1739,7 +1740,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
snapshot_now->curcid = command_id;
TeardownHistoricSnapshot(false);
- SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+ SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
/*
* Every time the CommandId is incremented, we could
@@ -1824,6 +1825,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
PG_CATCH();
{
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
+
+ /*
+ * if the catalog scan access returned an error of
+ * rollback, then abort on the other side as well
+ */
+ if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ elog(LOG, "stopping decoding of xid %u (gid \"%s\")",
+ txn->xid, txn->gid ? txn->gid : "");
+ rb->abort(rb, txn, commit_lsn);
+ }
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);
@@ -1847,7 +1862,14 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
/* remove potential on-disk data, and deallocate */
ReorderBufferCleanupTXN(rb, txn);
- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ FlushErrorState();
}
PG_END_TRY();
}
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index f93b37b9c9..3690a4bf02 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -151,6 +151,13 @@ static Snapshot SecondarySnapshot = NULL;
static Snapshot CatalogSnapshot = NULL;
static Snapshot HistoricSnapshot = NULL;
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
* access to behave just like it did at a certain point in the past.
*
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive. This is to re-check XID status while accessing catalog.
+ *
* Needed for logical decoding.
*/
void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+ TransactionId snapshot_xid)
{
Assert(historic_snapshot != NULL);
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
/* setup (cmin, cmax) lookup hash */
tuplecid_data = tuplecids;
-}
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check
+ * if the xid aborted. That will happen during catalog access.
+ */
+ if (TransactionIdIsValid(snapshot_xid) &&
+ !TransactionIdDidCommit(snapshot_xid))
+ CheckXidAlive = snapshot_xid;
+ else
+ CheckXidAlive = InvalidTransactionId;
+}
/*
* Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
{
HistoricSnapshot = NULL;
tuplecid_data = NULL;
+ CheckXidAlive = InvalidTransactionId;
}
bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f8308e6925..de148aa45e 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
/* Support for catalog timetravel for logical decoding */
struct HTAB;
+extern TransactionId CheckXidAlive;
extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+ TransactionId snapshot_xid);
extern void TeardownHistoricSnapshot(bool is_error);
extern bool HistoricSnapshotActive(void);
--
2.15.2 (Apple Git-101.1)
0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4.patchDownload
From dba6debb7237e65088a14510f07d876940a67f87 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/4] Teach test_decoding plugin to work with 2PC
Implement all callbacks required for decoding 2PC in this test_decoding
plugin. Includes relevant test cases as well.
Additionally, includes a new option "check-xid". If this option points
to a valid xid, then the pg_decode_change() API will wait for it to
be aborted externally. This allows us to test concurrent rollback of
a prepared transaction while it's being actually decoded simultaneously.
---
contrib/test_decoding/Makefile | 2 +
contrib/test_decoding/expected/prepared.out | 185 ++++++++++++++++++++++++----
contrib/test_decoding/sql/prepared.sql | 77 ++++++++++--
contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++
contrib/test_decoding/test_decoding.c | 179 +++++++++++++++++++++++++++
5 files changed, 530 insertions(+), 32 deletions(-)
create mode 100644 contrib/test_decoding/t/001_twophase.pl
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 4afb1d963e..6bac8a3fe5 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -12,6 +12,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
+TAP_TESTS = 1
+
# Disabled because these tests require "wal_level=logical", which
# typical installcheck users do not have (e.g. buildfarm clients).
NO_INSTALLCHECK = 1
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..934c8f1509 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
init
(1 row)
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
INSERT INTO test_prepared1 VALUES (4);
-- test prepared xact containing ddl
BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+ relation | locktype | mode
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
BEGIN
table public.test_prepared1: INSERT: id[integer]:4
COMMIT
BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
table public.test_prepared1: INSERT: id[integer]:5
table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------------------------------------
BEGIN
table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
COMMIT
BEGIN
table public.test_prepared2: INSERT: id[integer]:9
COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+ relation | locktype | mode
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data
+------
+(0 rows)
SELECT pg_drop_replication_slot('regression_slot');
pg_drop_replication_slot
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..60725419fe 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
SET synchronous_commit = on;
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
-- test simple successful use of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (1);
PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (2);
-- test abort of a prepared xact
BEGIN;
INSERT INTO test_prepared1 VALUES (3);
PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
INSERT INTO test_prepared1 VALUES (4);
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
INSERT INTO test_prepared1 VALUES (6, 'frakbar');
PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- make sure stuff still works
INSERT INTO test_prepared1 VALUES (8);
INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+ AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-- cleanup
DROP TABLE test_prepared1;
DROP TABLE test_prepared2;
--- show results
+-- show results. There should be nothing to show
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..50f269bef7
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+ 'postgresql.conf', qq(
+ max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+ "INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+ "SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13,14);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+ or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of xid $xid2pc")
+ or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+ BEGIN;
+ INSERT INTO tab VALUES (11);
+ INSERT INTO tab VALUES (12);
+ ALTER TABLE tab ADD COLUMN b INT;
+ INSERT INTO tab VALUES (13, 11);
+ PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+ my ($expected) = @_;
+
+ $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+ my $max_attempts = 180 * 10;
+ my $attempts = 0;
+
+ my $output_file = '';
+ while ($attempts < $max_attempts)
+ {
+ $output_file = slurp_file($node_logical->logfile());
+
+ if ($output_file =~ $expected)
+ {
+ return 1;
+ }
+
+ # Wait 0.1 second before retrying.
+ usleep(100_000);
+ $attempts++;
+ }
+
+ # The output result didn't change in 180 seconds. Give up
+ return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e3f394f512..9687eb293b 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
*-------------------------------------------------------------------------
*/
#include "postgres.h"
+#include "miscadmin.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "replication/logical.h"
#include "replication/origin.h"
+#include "storage/procarray.h"
+
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;
static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
bool last_write);
static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
static void pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, Relation rel,
ReorderBufferChange *change);
@@ -62,6 +69,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, XLogRecPtr message_lsn,
bool transactional, const char *prefix,
Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
void
_PG_init(void)
@@ -80,9 +99,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pg_decode_change;
cb->truncate_cb = pg_decode_truncate;
cb->commit_cb = pg_decode_commit_txn;
+ cb->abort_cb = pg_decode_abort_txn;
cb->filter_by_origin_cb = pg_decode_filter;
cb->shutdown_cb = pg_decode_shutdown;
cb->message_cb = pg_decode_message;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
}
@@ -102,11 +126,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
data->include_timestamp = false;
data->skip_empty_xacts = false;
data->only_local = false;
+ data->check_xid = InvalidTransactionId;
ctx->output_plugin_private = data;
opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
opt->receive_rewrites = false;
+ /* this plugin supports decoding of 2pc */
+ opt->enable_twophase = true;
foreach(option, ctx->output_plugin_options)
{
@@ -183,6 +210,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "check-xid") == 0)
+ {
+ if (elem->arg)
+ {
+ errno = 0;
+ data->check_xid = (TransactionId)
+ strtoul(strVal(elem->arg), NULL, 0);
+
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid is not a valid number: \"%s\"",
+ strVal(elem->arg))));
+ }
+ else
+ ereport(FATAL,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid needs an input value")));
+
+ if (data->check_xid <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Specify positive value for parameter \"%s\","
+ " you specified \"%s\"",
+ elem->defname, strVal(elem->arg))));
+ }
else
{
ereport(ERROR,
@@ -251,6 +304,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+ else
+ appendStringInfoString(ctx->out, "ABORT");
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)
+{
+ if (strstr(gid, "_nodecode") != NULL)
+ return true;
+
+ return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+ quote_literal_cstr(txn->gid));
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, " %u", txn->xid);
+
+ if (data->include_timestamp)
+ appendStringInfo(ctx->out, " (at %s)",
+ timestamptz_to_str(txn->commit_time));
+
+ OutputPluginWrite(ctx, true);
+}
+
static bool
pg_decode_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id)
@@ -409,6 +572,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;
+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(data->check_xid))
+ {
+ CHECK_FOR_INTERRUPTS();
+ pg_usleep(10000L);
+ }
+ if (!TransactionIdIsInProgress(data->check_xid) &&
+ !TransactionIdDidCommit(data->check_xid))
+ elog(LOG, "%u aborted", data->check_xid);
+
+ Assert(TransactionIdDidAbort(data->check_xid));
+ }
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);
--
2.15.2 (Apple Git-101.1)
Hi Arseny,
I hadn't checked whether my concerns where addressed in the latest
version though.
I'd like to believe that the latest patch set tries to address some
(if not all) of your concerns. Can you please take a look and let me
know?
Regards,
Nikhil
--
Nikhil Sontakke
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/
Nikhil Sontakke <nikhils@2ndquadrant.com> writes:
I'd like to believe that the latest patch set tries to address some
(if not all) of your concerns. Can you please take a look and let me
know?
Hi, sure.
General things:
- Earlier I said that there is no point of sending COMMIT PREPARED if
decoding snapshot became consistent after PREPARE, i.e. PREPARE hadn't
been sent. I realized since then that such use cases actually exist:
prepare might be copied to the replica by e.g. basebackup or something
else earlier. Still, a plugin must be able to easily distinguish these
too early PREPARES without doing its own bookkeeping (remembering each
PREPARE it has seen). Fortunately, turns out this we can make it
easy. If during COMMIT PREPARED / ABORT PREPARED record decoding we
see that ReorderBufferTXN with such xid exists, it means that either
1) plugin refused to do replay of this xact at PREPARE or 2) PREPARE
was too early in the stream. Otherwise xact would be replayed at
PREPARE processing and rbtxn purged immediately after. I think we
should add this to the documentation of filter_prepare_cb. Also, to
this end we need to add an argument to this callback specifying at
which context it was called: during prepare / commit prepared / abort
prepared. Also, for this to work, ReorderBufferProcessXid must be
always called at PREPARE, not only when 2PC decoding is disabled.
- BTW, ReorderBufferProcessXid at PREPARE should be always called
anyway, because otherwise if xact is empty, we will not prepare it
(and call cb), even if the output plugin asked us not to filter it
out. However, we will call commit_prepared cb, which is inconsistent.
- I find it weird that in DecodePrepare and in DecodeCommit you always
ask the plugin whether to filter an xact, given that sometimes you
know beforehand that you are not going to replay it: it might have
already been replayed, might have wrong dbid, origin, etc. One
consequence of this: imagine that notorious xact with PREPARE before
point where snapshot became consistent and COMMIT PREPARED after that
point. Even if filter_cb says 'I want 2PC on this xact', with current
code it won't be replayed on PREPARE and rbxid will be destroyed with
ReorderBufferForget. Now this xact is lost.
- Doing full-blown SnapBuildCommitTxn during PREPARE decoding is wrong,
because xact effects must not yet be seen to others. I discussed this
at length and described adjacent problems in [1]/messages/by-id/87zhxrwgvh.fsf@ars-thinkpad.
- I still don't like that if 2PC xact was aborted and its replay
stopped, prepare callback won't be called but abort_prepared would be.
This either should be documented or fixed.
Second patch:
+ /* filter_prepare is optional, but requires two-phase decoding */
+ if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+ ereport(ERROR,
+ (errmsg("Output plugin does not support two-phase decoding, but "
+ "registered filter_prepared callback.")));
I actually think that enable_twophase output plugin option is
redundant. If plugin author wants 2PC, he just provides
filter_prepare_cb callback and potentially others. I also don't see much
value in checking that exactly 0 or 3 callbacks were registred.
- You allow (commit|abort)_prepared_cb, prepare_cb callbacks to be not
specified with enabled 2PC and call them without check that they
actually exist.
- executed within that transaction.
+ executed within that transaction. A transaction that is prepared for
+ a two-phase commit using <command>PREPARE TRANSACTION</command> will
+ also be decoded if the output plugin callbacks needed for decoding
+ them are provided. It is possible that the current transaction which
+ is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+ command. In that case, the logical decoding of this transaction will
+ be aborted too.
This should say explicitly that such 2PC xact will be decoded at PREPARE
record. Probably also add that otherwise it is decoded at CP
record. Probably also add "and abort_cb callback called" to the last
sentence.
+ The required <function>abort_cb</function> callback is called whenever
+ a transaction abort has to be initiated. This can happen if we are
This callback is not required in the code, and it would be indeed a bad
idea to demand it, breaking compatibility with existing plugins not
caring about 2PC.
+ * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).
+ */
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
This is dead code since we don't have decoding of in-progress xacts yet.
+ /*
+ * If there is a valid top-level transaction that's different from the
+ * two-phase one we are aborting, clear its reorder buffer as well.
+ */
+ if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+ ReorderBufferAbort(ctx->reorder, xid, origin_lsn);
What is the aim of this? How xl_xid xid of commit prepared record can be
normal?
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+ * it to be created below.
+ */
Code around looks sane, but I think that restarts are irrelevant to
rbtxn existence at this moment: if we are going to COMMIT/ABORT PREPARED
it, it must have been replayed and rbtxn purged immediately after. The
only reason why rbtxn can exist here is invalidation addition
(ReorderBufferAddInvalidations) happening a couple of calls earlier.
Also, instead of misty '2PC transactions do not contain any
reorderbuffers' I would say something like 'create dummy
ReorderBufferTXN to pass it to the callback'.
- filter_prepare_cb callback existence is checked in both decode.c and
in filter_prepare_cb_wrapper.
Third patch:
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
I would explain here that this xid is checked for abort after each
catalog scan, and sent for the details to SetupHistoricSnapshot.
Nitpicking:
First patch: I still don't think that these flags need a bitmask.
Second patch:
- I still think ReorderBufferCommitInternal name is confusing and should
be renamed to something like ReorderBufferReplay.
/* Do we know this is a subxact? Xid of top-level txn if so */
TransactionId toplevel_xid;
+ /* In case of 2PC we need to pass GID to output plugin */
+ char *gid;
Better add here newline as between other fields.
+ txn->txn_flags |= RBTXN_PREPARE;
+ txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+ strcpy(txn->gid, gid);
pstrdup?
- ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the
same and should be merged with comments explaining that the answer
must be stable.
+ The optional <function>commit_prepared_cb</function> callback is called whenever
+ a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
a commit prepared transaction *record* has been decoded?
Fourth patch:
Applying: Teach test_decoding plugin to work with 2PC
.git/rebase-apply/patch:347: trailing whitespace.
-- test savepoints
.git/rebase-apply/patch:424: trailing whitespace.
# get XID of the above two-phase transaction
warning: 2 lines add whitespace errors.
[1]: /messages/by-id/87zhxrwgvh.fsf@ars-thinkpad
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
I think the difference between abort and abort prepared should be
explained better (I am not quite sure I get it myself).
+ The required <function>abort_cb</function> callback is called whenever
Also, why is this one required when all the 2pc stuff is optional?
+static void +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, + xl_xact_parsed_prepare * parsed) +{ + XLogRecPtr origin_lsn = parsed->origin_lsn; + TimestampTz commit_time = parsed->origin_timestamp; + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); + TransactionId xid = parsed->twophase_xid; + bool skip; + + Assert(parsed->dbId != InvalidOid); + Assert(TransactionIdIsValid(parsed->twophase_xid)); + + /* Whether or not this PREPARE needs to be skipped. */ + skip = DecodeEndOfTxn(ctx, buf, parsed, xid); + + FinalizeTxnDecoding(ctx, buf, parsed, xid, skip);
Given that DecodeEndOfTxn calls SnapBuildCommitTxn, won't this make the
catalog changes done by prepared transaction visible to other
transactions (which is undesirable as they should only be visible after
it's committed) ?
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && + !(IsCatalogRelation(scan->rs_rd) || + RelationIsUsedAsCatalogTable(scan->rs_rd)))) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg("improper heap_getnext call"))); +
I think we should log the relation oid as well so that plugin developers
have easier time debugging this (for all variants of this).
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 14/01/2019 23:16, Arseny Sher wrote:
Nikhil Sontakke <nikhils@2ndquadrant.com> writes:
I'd like to believe that the latest patch set tries to address some
(if not all) of your concerns. Can you please take a look and let me
know?Hi, sure.
General things:
- Earlier I said that there is no point of sending COMMIT PREPARED if
decoding snapshot became consistent after PREPARE, i.e. PREPARE hadn't
been sent. I realized since then that such use cases actually exist:
prepare might be copied to the replica by e.g. basebackup or something
else earlier.
Basebackup does not copy slots though and slot should not reach
consistency until all prepared transactions are committed no?
- BTW, ReorderBufferProcessXid at PREPARE should be always called
anyway, because otherwise if xact is empty, we will not prepare it
(and call cb), even if the output plugin asked us not to filter it
out. However, we will call commit_prepared cb, which is inconsistent.- I find it weird that in DecodePrepare and in DecodeCommit you always
ask the plugin whether to filter an xact, given that sometimes you
know beforehand that you are not going to replay it: it might have
already been replayed, might have wrong dbid, origin, etc. One
consequence of this: imagine that notorious xact with PREPARE before
point where snapshot became consistent and COMMIT PREPARED after that
point. Even if filter_cb says 'I want 2PC on this xact', with current
code it won't be replayed on PREPARE and rbxid will be destroyed with
ReorderBufferForget. Now this xact is lost.
Yeah this is wrong.
Second patch:
+ /* filter_prepare is optional, but requires two-phase decoding */ + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase)) + ereport(ERROR, + (errmsg("Output plugin does not support two-phase decoding, but " + "registered filter_prepared callback.")));I actually think that enable_twophase output plugin option is
redundant. If plugin author wants 2PC, he just provides
filter_prepare_cb callback and potentially others.
+1
I also don't see much
value in checking that exactly 0 or 3 callbacks were registred.
I think that check makes sense, if you support 2pc you need to register
all callbacks.
Nitpicking:
First patch: I still don't think that these flags need a bitmask.
Since we are discussing this, I personally prefer the bitmask here.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Eyeballing 0001, it has a few problems.
1. It's under-parenthesizing the txn argument of the macros.
2. the "has"/"is" macro definitions don't return booleans -- see
fce4609d5e5b.
3. the remainder of this no longer makes sense:
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;
I suggest to fix the comment, and also improve the comment next to the
macro that tests this flag.
(4. the macro names are ugly.)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 25, 2019 at 02:03:27PM -0300, Alvaro Herrera wrote:
Eyeballing 0001, it has a few problems.
1. It's under-parenthesizing the txn argument of the macros.
2. the "has"/"is" macro definitions don't return booleans -- see
fce4609d5e5b.3. the remainder of this no longer makes sense:
/* Do we know this is a subxact? Xid of top-level txn if so */
- bool is_known_as_subxact;
TransactionId toplevel_xid;I suggest to fix the comment, and also improve the comment next to the
macro that tests this flag.(4. the macro names are ugly.)
This is an old thread, and the latest review is very recent. So I am
moving the patch to next CF, waiting on author.
--
Michael
I don't understand why this patch record has been kept aliv for so long,
since no new version has been sent in ages. If this patch is really
waiting on the author, let's see the author do something. If no voice
is heard very soon, I'll close this patch as RwF.
If others want to see this feature in PostgreSQL, they are welcome to
contribute.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 9/2/19 6:12 PM, Alvaro Herrera wrote:
I don't understand why this patch record has been kept aliv for so long,
since no new version has been sent in ages. If this patch is really
waiting on the author, let's see the author do something. If no voice
is heard very soon, I'll close this patch as RwF.
+1. I should have marked this RWF in March but I ignored it because it
was tagged v13 before the CF started.
--
-David
david@pgmasters.net