Commit Timestamp and LSN Inversion issue
Hello hackers,
(Cc people involved in the earlier discussion)
I would like to discuss the $Subject.
While discussing Logical Replication's Conflict Detection and
Resolution (CDR) design in [1](See: "How is this going to deal with the fact that commit LSN and timestamps may not correlate perfectly?"). /messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com , it came to our notice that the
commit LSN and timestamp may not correlate perfectly i.e. commits may
happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
because, during the commit process, the timestamp (xactStopTimestamp)
is captured slightly earlier than when space is reserved in the WAL.
~~
Reproducibility of conflict-resolution problem due to the timestamp inversion
------------------------------------------------
It was suggested that timestamp inversion *may* impact the time-based
resolutions such as last_update_wins (targeted to be implemented in
[1]: (See: "How is this going to deal with the fact that commit LSN and timestamps may not correlate perfectly?"). /messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com
are not correctly ordered. And thus we tried some tests but failed to
find any practical scenario where it could be a problem.
Basically, the proposed conflict resolution is a row-level resolution,
and to cause the row value to be inconsistent, we need to modify the
same row in concurrent transactions and commit the changes
concurrently. But this doesn't seem possible because concurrent
updates on the same row are disallowed (e.g., the later update will be
blocked due to the row lock). See [2]/messages/by-id/CAA4eK1JTMiBOoGqkt=aLPLU8Rs45ihbLhXaGHsz8XC76+OG3+Q@mail.gmail.com for the details.
We tried to give some thoughts on multi table cases as well e.g.,
update table A with foreign key and update the table B that table A
refers to. But update on table A will block the update on table B as
well, so we could not reproduce data-divergence due to the
LSN/timestamp mismatch issue there.
~~
Idea proposed to fix the timestamp inversion issue
------------------------------------------------
There was a suggestion in [3](See: "The clock would need to be monotonic (easy enough with CLOCK_MONOTONIC"). /messages/by-id/a3a70a19-a35e-426c-8646-0898cdc207c8@enterprisedb.com to acquire the timestamp while reserving
the space (because that happens in LSN order). The clock would need to
be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
main problem why it's being done outside the critical section, because
gettimeofday() may be quite expensive. There's a concept of hybrid
clock, combining "time" and logical counter, which might be useful
independently of CDR.
On further analyzing this idea, we found that CLOCK_MONOTONIC can be
accepted only by clock_gettime() which has more precision than
gettimeofday() and thus is equally or more expensive theoretically (we
plan to test it and post the results). It does not look like a good
idea to call any of these when holding spinlock to reserve the wal
position. As for the suggested solution "hybrid clock", it might not
help here because the logical counter is only used to order the
transactions with the same timestamp. The problem here is how to get
the timestamp along with wal position
reservation(ReserveXLogInsertLocation).
~~
We can explore further but as we are not able to see any real-time
scenario where this could actually be problem, it may or may not be
worth to spend time on this. Thoughts?
[1]: (See: "How is this going to deal with the fact that commit LSN and timestamps may not correlate perfectly?"). /messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com
(See: "How is this going to deal with the fact that commit LSN and
timestamps may not correlate perfectly?").
/messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com
[2]: /messages/by-id/CAA4eK1JTMiBOoGqkt=aLPLU8Rs45ihbLhXaGHsz8XC76+OG3+Q@mail.gmail.com
/messages/by-id/CAA4eK1JTMiBOoGqkt=aLPLU8Rs45ihbLhXaGHsz8XC76+OG3+Q@mail.gmail.com
[3]: (See: "The clock would need to be monotonic (easy enough with CLOCK_MONOTONIC"). /messages/by-id/a3a70a19-a35e-426c-8646-0898cdc207c8@enterprisedb.com
(See: "The clock would need to be monotonic (easy enough with
CLOCK_MONOTONIC").
/messages/by-id/a3a70a19-a35e-426c-8646-0898cdc207c8@enterprisedb.com
thanks
Shveta
Hi Shveta,
While discussing Logical Replication's Conflict Detection and
Resolution (CDR) design in [1] , it came to our notice that the
commit LSN and timestamp may not correlate perfectly i.e. commits may
happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
because, during the commit process, the timestamp (xactStopTimestamp)
is captured slightly earlier than when space is reserved in the WAL.
[...]
There was a suggestion in [3] to acquire the timestamp while reserving
the space (because that happens in LSN order). The clock would need to
be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
main problem why it's being done outside the critical section, because
gettimeofday() may be quite expensive. There's a concept of hybrid
clock, combining "time" and logical counter, which might be useful
independently of CDR.
I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.
This is the reason why {latest|earliest}_timestamp_wins strategies you
are proposing to use for CDR are poor strategies. In practice they
work as random_key_wins which is not extremely useful (what it does is
basically _deleting_ random data, not solving conflicts). On top of
that strategies like this don't take into account checks and
constraints the database may have, including high-level constraints
that may not be explicitly defined in the DBMS but may exist in the
application logic.
Unfortunately I can't reference a particular article or white paper on
the subject but I know that "last write wins" was widely criticized
back in the 2010s when people were building distributed systems on
commodity hardware. In this time period I worked on several projects
as a backend software engineer and I can assure you that LWW is not
something you want.
IMO the right approach to the problem would be defining procedures for
conflict resolution that may not only semi-randomly choose between two
tuples but also implement a user-defined logic. Similarly to INSERT
INTO ... ON CONFLICT ... semantics, or similar approaches from
long-lived and well-explored distributed system, e.g. Riak.
Alternatively / additionally we could support CRDTs in Postgres.
--
Best regards,
Aleksander Alekseev
On Wed, Sep 4, 2024 at 2:05 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
While discussing Logical Replication's Conflict Detection and
Resolution (CDR) design in [1] , it came to our notice that the
commit LSN and timestamp may not correlate perfectly i.e. commits may
happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
because, during the commit process, the timestamp (xactStopTimestamp)
is captured slightly earlier than when space is reserved in the WAL.
[...]
There was a suggestion in [3] to acquire the timestamp while reserving
the space (because that happens in LSN order). The clock would need to
be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
main problem why it's being done outside the critical section, because
gettimeofday() may be quite expensive. There's a concept of hybrid
clock, combining "time" and logical counter, which might be useful
independently of CDR.I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.
It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.
Your point as far as I can understand is that in the first place, it
is not a good idea to have a strategy like "last_write_wins" which
relies on the system clock. So, even if LSN->Timestamp ordering has
any problem, it won't matter to us. Now, we can discuss whether
"last_write_wins" is a poor strategy or not but if possible, for the
sake of the point of this thread, let's assume that users using the
resolution feature ("last_write_wins") ensure that clocks are synced
or they won't enable this feature and then see if we can think of any
problem w.r.t the current code.
--
With Regards,
Amit Kapila.
Hi Amit,
I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.
Hm.... Then I'm having difficulties understanding why this is a
problem and why it was necessary to mention CDR in this context in the
first place.
OK, let's forget about CDR completely. Who is affected by the current
behavior and why would it be beneficial changing it?
--
Best regards,
Aleksander Alekseev
On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.Hm.... Then I'm having difficulties understanding why this is a
problem
This is a potential problem pointed out during discussion of CDR [1]/messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com
(Please read the point starting from "How is this going to deal .."
and response by Shveta). The point of this thread is that though it
appears to be a problem but practically there is no scenario where it
can impact even when we implement "last_write_wins" startegy as
explained in the initial email. If you or someone sees a problem due
to LSN<->timestamp inversion then we need to explore the solution for
it.
and why it was necessary to mention CDR in this context in the
first place.OK, let's forget about CDR completely. Who is affected by the current
behavior and why would it be beneficial changing it?
We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.
Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]/messages/by-id/CAJpy0uD0-DpYVMtsxK5R=zszXauZBayQMAYET9sWr_w0CNWXxQ@mail.gmail.com. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.
[1]: /messages/by-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA@mail.gmail.com
[2]: /messages/by-id/CAJpy0uD0-DpYVMtsxK5R=zszXauZBayQMAYET9sWr_w0CNWXxQ@mail.gmail.com
--
With Regards,
Amit Kapila.
On Wed, Sep 4, 2024 at 12:23 PM shveta malik <shveta.malik@gmail.com> wrote:
Hello hackers,
(Cc people involved in the earlier discussion)I would like to discuss the $Subject.
While discussing Logical Replication's Conflict Detection and
Resolution (CDR) design in [1] , it came to our notice that the
commit LSN and timestamp may not correlate perfectly i.e. commits may
happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
because, during the commit process, the timestamp (xactStopTimestamp)
is captured slightly earlier than when space is reserved in the WAL.~~
Reproducibility of conflict-resolution problem due to the timestamp inversion
------------------------------------------------
It was suggested that timestamp inversion *may* impact the time-based
resolutions such as last_update_wins (targeted to be implemented in
[1]) as we may end up making wrong decisions if timestamps and LSNs
are not correctly ordered. And thus we tried some tests but failed to
find any practical scenario where it could be a problem.Basically, the proposed conflict resolution is a row-level resolution,
and to cause the row value to be inconsistent, we need to modify the
same row in concurrent transactions and commit the changes
concurrently. But this doesn't seem possible because concurrent
updates on the same row are disallowed (e.g., the later update will be
blocked due to the row lock). See [2] for the details.We tried to give some thoughts on multi table cases as well e.g.,
update table A with foreign key and update the table B that table A
refers to. But update on table A will block the update on table B as
well, so we could not reproduce data-divergence due to the
LSN/timestamp mismatch issue there.~~
Idea proposed to fix the timestamp inversion issue
------------------------------------------------
There was a suggestion in [3] to acquire the timestamp while reserving
the space (because that happens in LSN order). The clock would need to
be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
main problem why it's being done outside the critical section, because
gettimeofday() may be quite expensive. There's a concept of hybrid
clock, combining "time" and logical counter, which might be useful
independently of CDR.On further analyzing this idea, we found that CLOCK_MONOTONIC can be
accepted only by clock_gettime() which has more precision than
gettimeofday() and thus is equally or more expensive theoretically (we
plan to test it and post the results). It does not look like a good
idea to call any of these when holding spinlock to reserve the wal
position. As for the suggested solution "hybrid clock", it might not
help here because the logical counter is only used to order the
transactions with the same timestamp. The problem here is how to get
the timestamp along with wal position
reservation(ReserveXLogInsertLocation).
Here are the tests done to compare clock_gettime() and gettimeofday()
performance.
Machine details :
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
CPU(s): 120; 800GB RAM
Three functions were tested across three different call volumes (1
million, 100 million, and 1 billion):
1) clock_gettime() with CLOCK_REALTIME
2) clock_gettime() with CLOCK_MONOTONIC
3) gettimeofday()
--> clock_gettime() with CLOCK_MONOTONIC sometimes shows slightly
better performance, but not consistently. The difference in time taken
by all three functions is minimal, with averages varying by no more
than ~2.5%. Overall, the performance between CLOCK_MONOTONIC and
gettimeofday() is essentially the same.
Below are the test results -
(each test was run twice for consistency)
1) For 1 million calls:
1a) clock_gettime() with CLOCK_REALTIME:
- Run 1: 0.01770 seconds, Run 2: 0.01772 seconds, Average: 0.01771 seconds.
1b) clock_gettime() with CLOCK_MONOTONIC:
- Run 1: 0.01753 seconds, Run 2: 0.01748 seconds, Average: 0.01750 seconds.
1c) gettimeofday():
- Run 1: 0.01742 seconds, Run 2: 0.01777 seconds, Average: 0.01760 seconds.
2) For 100 million calls:
2a) clock_gettime() with CLOCK_REALTIME:
- Run 1: 1.76649 seconds, Run 2: 1.76602 seconds, Average: 1.76625 seconds.
2b) clock_gettime() with CLOCK_MONOTONIC:
- Run 1: 1.72768 seconds, Run 2: 1.72988 seconds, Average: 1.72878 seconds.
2c) gettimeofday():
- Run 1: 1.72436 seconds, Run 2: 1.72174 seconds, Average: 1.72305 seconds.
3) For 1 billion calls:
3a) clock_gettime() with CLOCK_REALTIME:
- Run 1: 17.63859 seconds, Run 2: 17.65529 seconds, Average:
17.64694 seconds.
3b) clock_gettime() with CLOCK_MONOTONIC:
- Run 1: 17.15109 seconds, Run 2: 17.27406 seconds, Average:
17.21257 seconds.
3c) gettimeofday():
- Run 1: 17.21368 seconds, Run 2: 17.22983 seconds, Average:
17.22175 seconds.
~~~~
Attached the scripts used for tests.
--
Thanks,
Nisha
Attachments:
Hi Hackers,
On 9/5/24 01:39, Amit Kapila wrote:
On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.Hm.... Then I'm having difficulties understanding why this is a
problemThis is a potential problem pointed out during discussion of CDR [1]
(Please read the point starting from "How is this going to deal .."
and response by Shveta). The point of this thread is that though it
appears to be a problem but practically there is no scenario where it
can impact even when we implement "last_write_wins" startegy as
explained in the initial email. If you or someone sees a problem due
to LSN<->timestamp inversion then we need to explore the solution for
it.and why it was necessary to mention CDR in this context in the
first place.OK, let's forget about CDR completely. Who is affected by the current
behavior and why would it be beneficial changing it?We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.
I agree that we can't forget about CDR. This is precisely the problem we
ran into here at pgEdge and why we came up with a solution (attached).
Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.
Fact is that "last_write_wins" together with some implementation of
Conflict free Replicated Data Types (CRDT) is good enough for many real
world situations. Anything resembling a TPC-B or TPC-C is quite happy
with it.
The attached solution is minimally invasive because it doesn't move the
timestamp generation (clock_gettime() call) into the critical section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead it
keeps track of the last commit-ts written to WAL in shared memory and
simply bumps that by one microsecond if the next one is below or equal.
There is one extra condition in that code section plus a function call
by pointer for every WAL record. In the unlikely case of encountering a
stalled or backwards running commit-ts, the expensive part of
recalculating the CRC of the altered commit WAL-record is done later,
after releasing the spinlock. I have not been able to measure any
performance impact on a machine with 2x Xeon-Silver (32 HT cores).
This will work fine until we have systems that can sustain a commit rate
of one million transactions per second or higher for more than a few
milliseconds.
Regards, Jan
Attachments:
pg18-025-logical_commit_clock.difftext/x-patch; charset=UTF-8; name=pg18-025-logical_commit_clock.diffDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 004f7e10e5..c0d2466f41 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -371,6 +371,12 @@ static void ShowTransactionStateRec(const char *str, TransactionState s);
static const char *BlockStateAsString(TBlockState blockState);
static const char *TransStateAsString(TransState state);
+/*
+ * Hook function to be called while holding the WAL insert spinlock.
+ * to adjust commit timestamps via Lamport clock if needed.
+ */
+static void EnsureMonotonicTransactionStopTimestamp(void *data);
+
/* ----------------------------------------------------------------
* transaction state accessors
@@ -2215,6 +2221,13 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * Reset XLogReserveInsertHook (function called while holding
+ * the WAL insert spinlock)
+ */
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
ShowTransactionState("StartTransaction");
}
@@ -5837,6 +5850,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_twophase xl_twophase;
xl_xact_origin xl_origin;
uint8 info;
+ XLogRecPtr result;
Assert(CritSectionCount > 0);
@@ -5980,7 +5994,19 @@ XactLogCommitRecord(TimestampTz commit_time,
/* we allow filtering by xacts */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ /*
+ * Install our hook for the call to ReserveXLogInsertLocation() so that
+ * we can modify the xactStopTimestamp and the xact_time of the xlrec
+ * while holding the lock that determines the commit-LSN to ensure
+ * the commit timestamps are monotonically increasing.
+ */
+ XLogReserveInsertHook = EnsureMonotonicTransactionStopTimestamp;
+ XLogReserveInsertHookData = (void *)&xlrec;
+ result = XLogInsert(RM_XACT_ID, info);
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
+ return result;
}
/*
@@ -6451,3 +6477,40 @@ xact_redo(XLogReaderState *record)
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
+
+/*
+ * Hook function used in XactLogCommitRecord() to ensure that the
+ * commit timestamp is monotonically increasing in commit-LSN order.
+ */
+static void
+EnsureMonotonicTransactionStopTimestamp(void *data)
+{
+ xl_xact_commit *xlrec = (xl_xact_commit *)data;
+ TimestampTz logical_clock;
+
+ logical_clock = XLogGetLastTransactionStopTimestamp();
+
+ /*
+ * This is a local transaction. Make sure that the xact_time
+ * higher than any timestamp we have seen thus far.
+ *
+ * TODO: This is not postmaster restart safe. If the local
+ * system clock is further behind other nodes than it takes
+ * for the postmaster to restart (time between it stops
+ * accepting new transactions and time when it becomes ready
+ * to accept new transactions), local transactions will not
+ * be bumped into the future correctly.
+ */
+ if (logical_clock >= xlrec->xact_time)
+ {
+ logical_clock++;
+ xlrec->xact_time = logical_clock;
+ xactStopTimestamp = logical_clock;
+
+ XLogReserveInsertHookModifiedRecord = true;
+ }
+ else
+ logical_clock = xlrec->xact_time;
+
+ XLogSetLastTransactionStopTimestamp(logical_clock);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f14d3933ae..e1d3122445 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -156,6 +156,16 @@ int wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
*/
int CheckPointSegments;
+/*
+ * Hook to be called inside of ReserveXLogInsertLocation() for
+ * operations that need to be performed while holding the WAL
+ * insert spinlock. Currently this is used to guarantee a monotonically
+ * increasing commit-timestamp in LSN order.
+ */
+XLogReserveInsertHookType XLogReserveInsertHook = NULL;
+void *XLogReserveInsertHookData = NULL;
+bool XLogReserveInsertHookModifiedRecord = false;
+
/* Estimated distance between checkpoints, in bytes */
static double CheckPointDistanceEstimate = 0;
static double PrevCheckPointDistance = 0;
@@ -552,6 +562,14 @@ typedef struct XLogCtlData
XLogRecPtr lastFpwDisableRecPtr;
slock_t info_lck; /* locks shared variables shown above */
+
+ /*
+ * This is our shared, logical clock that we use to force
+ * commit timestamps to be monotonically increasing in
+ * commit-LSN order. This is protected by the Wal-insert
+ * spinlock.
+ */
+ TimestampTz lastTransactionStopTimestamp;
} XLogCtlData;
/*
@@ -701,6 +719,7 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
XLogRecData *rdata,
XLogRecPtr StartPos, XLogRecPtr EndPos,
TimeLineID tli);
+static void XLogRecordCorrectCRC(XLogRecData *rdata);
static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -779,6 +798,12 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /*
+ * Make sure the flag telling that ReserveXLog...() modified the
+ * record is false at this point.
+ */
+ XLogReserveInsertHookModifiedRecord = false;
+
/*
* Given that we're not in recovery, InsertTimeLineID is set and can't
* change, so we can read it without a lock.
@@ -907,6 +932,16 @@ XLogInsertRecord(XLogRecData *rdata,
if (inserted)
{
+ /*
+ * If our reserve hook modified the XLog Record,
+ * recalculate the CRC.
+ */
+ if (XLogReserveInsertHookModifiedRecord)
+ {
+ XLogRecordCorrectCRC(rdata);
+ XLogReserveInsertHookModifiedRecord = false;
+ }
+
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
@@ -1087,6 +1122,25 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
+/*
+ * Function to recalculate the WAL Record's CRC in case it was
+ * altered during the callback from ReserveXLogInsertLocation().
+ */
+static void
+XLogRecordCorrectCRC(XLogRecData *rdata)
+{
+ XLogRecData *rdt;
+ XLogRecord *rechdr = (XLogRecord *)rdata->data;
+ pg_crc32c rdata_crc;
+
+ INIT_CRC32C(rdata_crc);
+ COMP_CRC32C(rdata_crc, rdata->data + SizeOfXLogRecord, rdata->len - SizeOfXLogRecord);
+ for (rdt = rdata->next; rdt != NULL; rdt = rdt->next)
+ COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+
+ rechdr->xl_crc = rdata_crc;
+}
+
/*
* Reserves the right amount of space for a record of given size from the WAL.
* *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1131,6 +1185,12 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
*/
SpinLockAcquire(&Insert->insertpos_lck);
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -1190,6 +1250,12 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
return false;
}
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -9517,3 +9583,19 @@ SetWalWriterSleeping(bool sleeping)
XLogCtl->WalWriterSleeping = sleeping;
SpinLockRelease(&XLogCtl->info_lck);
}
+
+/*
+ * Get/set the last-transaction-stop-timestamp in shared memory.
+ * Caller must ensure that the WAL-insert spinlock is held.
+ */
+extern TimestampTz
+XLogGetLastTransactionStopTimestamp(void)
+{
+ return XLogCtl->lastTransactionStopTimestamp;
+}
+
+extern void
+XLogSetLastTransactionStopTimestamp(TimestampTz ts)
+{
+ XLogCtl->lastTransactionStopTimestamp = ts;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..187d1dc9bc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -199,6 +199,17 @@ typedef enum WALAvailability
struct XLogRecData;
struct XLogReaderState;
+/*
+ * Hook called from inside of holding the lock that determines
+ * the LSN order of WAL records. We currently use this to ensure that
+ * commit timestamps are monotonically increasing in their LSN
+ * order.
+ */
+typedef void (*XLogReserveInsertHookType)(void *data);
+extern XLogReserveInsertHookType XLogReserveInsertHook;
+extern void *XLogReserveInsertHookData;
+extern bool XLogReserveInsertHookModifiedRecord;
+
extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
XLogRecPtr fpw_lsn,
uint8 flags,
@@ -270,6 +281,13 @@ extern void SetInstallXLogFileSegmentActive(void);
extern bool IsInstallXLogFileSegmentActive(void);
extern void XLogShutdownWalRcv(void);
+/*
+ * Functions to access the last commit Lamport timestamp held in
+ * XLogCtl.
+ */
+extern TimestampTz XLogGetLastTransactionStopTimestamp(void);
+extern void XLogSetLastTransactionStopTimestamp(TimestampTz tz);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
On Tue, Nov 5, 2024 at 7:28 PM Jan Wieck <jan@wi3ck.info> wrote:
We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.I agree that we can't forget about CDR. This is precisely the problem we
ran into here at pgEdge and why we came up with a solution (attached).
I would like to highlight that we need to solve LSN<->Timestamp
inversion issue not only for resolution strategies like
'last_write_wins' but also for conflict detection as well. In
particular, while implementing/discussing the patch to detect the
update_deleted conflict type, we came across the race conditions [1]/messages/by-id/CAA4eK1LKgkjyNKeW5jEhy1=uE8z0p7Pdae0rohoJP51eJGd=gg@mail.gmail.com
where the inversion issue discussed here would lead to removing the
required rows before we could detect the conflict. So, +1 to solve
this issue.
Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.Fact is that "last_write_wins" together with some implementation of
Conflict free Replicated Data Types (CRDT) is good enough for many real
world situations. Anything resembling a TPC-B or TPC-C is quite happy
with it.The attached solution is minimally invasive because it doesn't move the
timestamp generation (clock_gettime() call) into the critical section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead it
keeps track of the last commit-ts written to WAL in shared memory and
simply bumps that by one microsecond if the next one is below or equal.
There is one extra condition in that code section plus a function call
by pointer for every WAL record.
I think we avoid calling hook/callback functions after holding a lock
(spinlock or LWLock) as the user may do an expensive operation or
acquire some other locks in those functions which could lead to
deadlocks or impact the concurrency. So, it would be better to
directly call an inline function to perform the required operation.
This sounds like a good idea to solve this problem. Thanks for sharing
the patch.
[1]: /messages/by-id/CAA4eK1LKgkjyNKeW5jEhy1=uE8z0p7Pdae0rohoJP51eJGd=gg@mail.gmail.com
--
With Regards,
Amit Kapila.
On 11/6/24 06:23, Amit Kapila wrote:
I think we avoid calling hook/callback functions after holding a lock
(spinlock or LWLock) as the user may do an expensive operation or
acquire some other locks in those functions which could lead to
deadlocks or impact the concurrency. So, it would be better to
directly call an inline function to perform the required operation.
This is a valid concern. The reason why I kept it as a hook is because
ReserveXLogInsertLocation() has no knowledge that this is allocating WAL
space for a commit record. Only the caller does. We certainly need to be
extremely careful what any such hook function is doing. Acquiring
additional locks is definitely not acceptable. But I am not sure we
should burden this function with specialized knowledge about what it is
reserving WAL space for.
Regards, Jan
On Tuesday, November 5, 2024 9:59 PM Jan Wieck <jan@wi3ck.info> wrote:
Hi Hackers,
On 9/5/24 01:39, Amit Kapila wrote:
We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.I agree that we can't forget about CDR. This is precisely the problem we ran into
here at pgEdge and why we came up with a solution (attached).Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.Fact is that "last_write_wins" together with some implementation of Conflict
free Replicated Data Types (CRDT) is good enough for many real world
situations. Anything resembling a TPC-B or TPC-C is quite happy with it.The attached solution is minimally invasive because it doesn't move the
timestamp generation (clock_gettime() call) into the critical section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead it keeps
track of the last commit-ts written to WAL in shared memory and simply bumps
that by one microsecond if the next one is below or equal.
There is one extra condition in that code section plus a function call by pointer
for every WAL record. In the unlikely case of encountering a stalled or
backwards running commit-ts, the expensive part of recalculating the CRC of
the altered commit WAL-record is done later, after releasing the spinlock. I have
not been able to measure any performance impact on a machine with 2x
Xeon-Silver (32 HT cores).This will work fine until we have systems that can sustain a commit rate of one
million transactions per second or higher for more than a few milliseconds.
Thanks for the patch! I am reading the patch and noticed few minor things.
1.
+ /*
+ * This is a local transaction. Make sure that the xact_time
+ * higher than any timestamp we have seen thus far.
+ *
+ * TODO: This is not postmaster restart safe. If the local
+ * system clock is further behind other nodes than it takes
+ * for the postmaster to restart (time between it stops
+ * accepting new transactions and time when it becomes ready
+ * to accept new transactions), local transactions will not
+ * be bumped into the future correctly.
+ */
The TODO section mentions other nodes, but I believe think patch currently do
not have the handling of timestamps for other nodes. Should we either remove
this part or add a brief explanation to clarify the relationship with other
nodes?
2.
+/*
+ * Hook function to be called while holding the WAL insert spinlock.
+ * to adjust commit timestamps via Lamport clock if needed.
+ */
The second line seems can be improved:
"to adjust commit timestamps .." => "It adjusts commit timestamps ..."
Best Regards,
Hou zj
On 11/6/24 21:30, Zhijie Hou (Fujitsu) wrote:
Thanks for the patch! I am reading the patch and noticed few minor things.
1. + /* + * This is a local transaction. Make sure that the xact_time + * higher than any timestamp we have seen thus far. + * + * TODO: This is not postmaster restart safe. If the local + * system clock is further behind other nodes than it takes + * for the postmaster to restart (time between it stops + * accepting new transactions and time when it becomes ready + * to accept new transactions), local transactions will not + * be bumped into the future correctly. + */The TODO section mentions other nodes, but I believe think patch currently do
not have the handling of timestamps for other nodes. Should we either remove
this part or add a brief explanation to clarify the relationship with other
nodes?
That TODO is actually obsolete. I understood from Amit Kapila that the
community does assume that NTP synchronization is good enough. And it
indeed is. Even my servers here at home are using a GPS based NTP server
connected to the LAN and are usually in sync by single-digit
microseconds. I removed it.
2. +/* + * Hook function to be called while holding the WAL insert spinlock. + * to adjust commit timestamps via Lamport clock if needed. + */The second line seems can be improved:
"to adjust commit timestamps .." => "It adjusts commit timestamps ..."
How about
/*
* Hook function to be called while holding the WAL insert spinlock.
* It guarantees that commit timestamps are advancing in LSN order.
*/
static void EnsureMonotonicTransactionStopTimestamp(void *data);
Thank you for looking at this and your input. New patch attached.
Best Regards, Jan
Attachments:
pg18-025-logical_commit_clock.difftext/x-patch; charset=UTF-8; name=pg18-025-logical_commit_clock.diffDownload
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b7ebcc2a55..0406dc44be 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -370,6 +370,12 @@ static void ShowTransactionStateRec(const char *str, TransactionState s);
static const char *BlockStateAsString(TBlockState blockState);
static const char *TransStateAsString(TransState state);
+/*
+ * Hook function to be called while holding the WAL insert spinlock.
+ * It guarantees that commit timestamps are advancing in LSN order.
+ */
+static void EnsureMonotonicTransactionStopTimestamp(void *data);
+
/* ----------------------------------------------------------------
* transaction state accessors
@@ -2214,6 +2220,13 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * Reset XLogReserveInsertHook (function called while holding
+ * the WAL insert spinlock)
+ */
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
ShowTransactionState("StartTransaction");
}
@@ -5831,6 +5844,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_twophase xl_twophase;
xl_xact_origin xl_origin;
uint8 info;
+ XLogRecPtr result;
Assert(CritSectionCount > 0);
@@ -5974,7 +5988,19 @@ XactLogCommitRecord(TimestampTz commit_time,
/* we allow filtering by xacts */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ /*
+ * Install our hook for the call to ReserveXLogInsertLocation() so that
+ * we can modify the xactStopTimestamp and the xact_time of the xlrec
+ * while holding the lock that determines the commit-LSN to ensure
+ * the commit timestamps are monotonically increasing.
+ */
+ XLogReserveInsertHook = EnsureMonotonicTransactionStopTimestamp;
+ XLogReserveInsertHookData = (void *)&xlrec;
+ result = XLogInsert(RM_XACT_ID, info);
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
+ return result;
}
/*
@@ -6445,3 +6471,33 @@ xact_redo(XLogReaderState *record)
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
+
+/*
+ * Hook function used in XactLogCommitRecord() to ensure that the
+ * commit timestamp is monotonically increasing in commit-LSN order.
+ */
+static void
+EnsureMonotonicTransactionStopTimestamp(void *data)
+{
+ xl_xact_commit *xlrec = (xl_xact_commit *)data;
+ TimestampTz logical_clock;
+
+ logical_clock = XLogGetLastTransactionStopTimestamp();
+
+ /*
+ * This is a local transaction. Make sure that the xact_time
+ * higher than any timestamp we have seen thus far.
+ */
+ if (logical_clock >= xlrec->xact_time)
+ {
+ logical_clock++;
+ xlrec->xact_time = logical_clock;
+ xactStopTimestamp = logical_clock;
+
+ XLogReserveInsertHookModifiedRecord = true;
+ }
+ else
+ logical_clock = xlrec->xact_time;
+
+ XLogSetLastTransactionStopTimestamp(logical_clock);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..bb70a7de09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -155,6 +155,16 @@ int wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
*/
int CheckPointSegments;
+/*
+ * Hook to be called inside of ReserveXLogInsertLocation() for
+ * operations that need to be performed while holding the WAL
+ * insert spinlock. Currently this is used to guarantee a monotonically
+ * increasing commit-timestamp in LSN order.
+ */
+XLogReserveInsertHookType XLogReserveInsertHook = NULL;
+void *XLogReserveInsertHookData = NULL;
+bool XLogReserveInsertHookModifiedRecord = false;
+
/* Estimated distance between checkpoints, in bytes */
static double CheckPointDistanceEstimate = 0;
static double PrevCheckPointDistance = 0;
@@ -551,6 +561,14 @@ typedef struct XLogCtlData
XLogRecPtr lastFpwDisableRecPtr;
slock_t info_lck; /* locks shared variables shown above */
+
+ /*
+ * This is our shared, logical clock that we use to force
+ * commit timestamps to be monotonically increasing in
+ * commit-LSN order. This is protected by the Wal-insert
+ * spinlock.
+ */
+ TimestampTz lastTransactionStopTimestamp;
} XLogCtlData;
/*
@@ -700,6 +718,7 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
XLogRecData *rdata,
XLogRecPtr StartPos, XLogRecPtr EndPos,
TimeLineID tli);
+static void XLogRecordCorrectCRC(XLogRecData *rdata);
static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -778,6 +797,12 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /*
+ * Make sure the flag telling that ReserveXLog...() modified the
+ * record is false at this point.
+ */
+ XLogReserveInsertHookModifiedRecord = false;
+
/*
* Given that we're not in recovery, InsertTimeLineID is set and can't
* change, so we can read it without a lock.
@@ -906,6 +931,16 @@ XLogInsertRecord(XLogRecData *rdata,
if (inserted)
{
+ /*
+ * If our reserve hook modified the XLog Record,
+ * recalculate the CRC.
+ */
+ if (XLogReserveInsertHookModifiedRecord)
+ {
+ XLogRecordCorrectCRC(rdata);
+ XLogReserveInsertHookModifiedRecord = false;
+ }
+
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
@@ -1086,6 +1121,25 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
+/*
+ * Function to recalculate the WAL Record's CRC in case it was
+ * altered during the callback from ReserveXLogInsertLocation().
+ */
+static void
+XLogRecordCorrectCRC(XLogRecData *rdata)
+{
+ XLogRecData *rdt;
+ XLogRecord *rechdr = (XLogRecord *)rdata->data;
+ pg_crc32c rdata_crc;
+
+ INIT_CRC32C(rdata_crc);
+ COMP_CRC32C(rdata_crc, rdata->data + SizeOfXLogRecord, rdata->len - SizeOfXLogRecord);
+ for (rdt = rdata->next; rdt != NULL; rdt = rdt->next)
+ COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+
+ rechdr->xl_crc = rdata_crc;
+}
+
/*
* Reserves the right amount of space for a record of given size from the WAL.
* *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1130,6 +1184,12 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
*/
SpinLockAcquire(&Insert->insertpos_lck);
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -1189,6 +1249,12 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
return false;
}
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -9510,3 +9576,19 @@ SetWalWriterSleeping(bool sleeping)
XLogCtl->WalWriterSleeping = sleeping;
SpinLockRelease(&XLogCtl->info_lck);
}
+
+/*
+ * Get/set the last-transaction-stop-timestamp in shared memory.
+ * Caller must ensure that the WAL-insert spinlock is held.
+ */
+extern TimestampTz
+XLogGetLastTransactionStopTimestamp(void)
+{
+ return XLogCtl->lastTransactionStopTimestamp;
+}
+
+extern void
+XLogSetLastTransactionStopTimestamp(TimestampTz ts)
+{
+ XLogCtl->lastTransactionStopTimestamp = ts;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..187d1dc9bc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -199,6 +199,17 @@ typedef enum WALAvailability
struct XLogRecData;
struct XLogReaderState;
+/*
+ * Hook called from inside of holding the lock that determines
+ * the LSN order of WAL records. We currently use this to ensure that
+ * commit timestamps are monotonically increasing in their LSN
+ * order.
+ */
+typedef void (*XLogReserveInsertHookType)(void *data);
+extern XLogReserveInsertHookType XLogReserveInsertHook;
+extern void *XLogReserveInsertHookData;
+extern bool XLogReserveInsertHookModifiedRecord;
+
extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
XLogRecPtr fpw_lsn,
uint8 flags,
@@ -270,6 +281,13 @@ extern void SetInstallXLogFileSegmentActive(void);
extern bool IsInstallXLogFileSegmentActive(void);
extern void XLogShutdownWalRcv(void);
+/*
+ * Functions to access the last commit Lamport timestamp held in
+ * XLogCtl.
+ */
+extern TimestampTz XLogGetLastTransactionStopTimestamp(void);
+extern void XLogSetLastTransactionStopTimestamp(TimestampTz tz);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
Hi,
On 2024-11-05 08:58:36 -0500, Jan Wieck wrote:
The attached solution is minimally invasive because it doesn't move the
timestamp generation (clock_gettime() call) into the critical section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead it
keeps track of the last commit-ts written to WAL in shared memory and simply
bumps that by one microsecond if the next one is below or equal. There is
one extra condition in that code section plus a function call by pointer for
every WAL record. In the unlikely case of encountering a stalled or
backwards running commit-ts, the expensive part of recalculating the CRC of
the altered commit WAL-record is done later, after releasing the spinlock. I
have not been able to measure any performance impact on a machine with 2x
Xeon-Silver (32 HT cores).
I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.
Even leaving that aside, something with a spinlock held is never a suitable
place to call a hook. As the header comments say:
* Keep in mind the coding rule that spinlocks must not be held for more
* than a few instructions. In particular, we assume it is not possible
* for a CHECK_FOR_INTERRUPTS() to occur while holding a spinlock, and so
* it is not necessary to do HOLD/RESUME_INTERRUPTS() in these macros.
I'm also quite certain that we don't want to just allow to redirect XLogInsert
to some random function. WAL logging is very crucial code and allowing to
randomly redirect it to something we don't know about will make it harder to
write correct WAL logging code and makes already quite complicated code even
more complicated.
This will work fine until we have systems that can sustain a commit rate of
one million transactions per second or higher for more than a few
milliseconds.
It's sustainable for many minutes today.
Greetings,
Andres Freund
On 11/7/24 13:19, Andres Freund wrote:
I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.
Aside from your technical concerns, can we at least agree that the
commit-ts vs LSN inversion is a problem that should get addressed?
Regards, Jan
Hi,
On 2024-11-07 15:05:31 -0500, Jan Wieck wrote:
On 11/7/24 13:19, Andres Freund wrote:
I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.Aside from your technical concerns, can we at least agree that the commit-ts
vs LSN inversion is a problem that should get addressed?
I'm not against addressing it, but I'd consider it not a high priority and
that the acceptable cost (both runtime and maintenance) of it should would
need to be extremely low.
Greetings,
Andres Freund
On Friday, November 8, 2024 2:20 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-11-05 08:58:36 -0500, Jan Wieck wrote:
The attached solution is minimally invasive because it doesn't move
the timestamp generation (clock_gettime() call) into the critical
section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead
it keeps track of the last commit-ts written to WAL in shared memory
and simply bumps that by one microsecond if the next one is below or
equal. There is one extra condition in that code section plus a
function call by pointer for every WAL record. In the unlikely case of
encountering a stalled or backwards running commit-ts, the expensive
part of recalculating the CRC of the altered commit WAL-record is done
later, after releasing the spinlock. I have not been able to measure
any performance impact on a machine with 2x Xeon-Silver (32 HT cores).I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.
I understand your concern and appreciate the feedback. I've made some
adjustments to the patch by directly placing the code to adjust the commit
timestamp within the spinlock, aiming to keep it as efficient as possible. The
changes have resulted in just a few extra lines. Would this approach be
acceptable to you ?
Additionally, we're also doing performance tests for it and will share the
results once they're available.
Best Regards,
Hou zj
Attachments:
v2-0001-POC-Make-commit-timestamp-monotonic-increase.patchapplication/octet-stream; name=v2-0001-POC-Make-commit-timestamp-monotonic-increase.patchDownload
From 1edaedf40d85baad7c6e1d12adae4e5ccc72077e Mon Sep 17 00:00:00 2001
From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
Date: Fri, 8 Nov 2024 15:10:48 +0800
Subject: [PATCH v4] Make commit timestamp monotonic increase
---
src/backend/access/transam/xact.c | 18 ++++-
src/backend/access/transam/xlog.c | 129 ++++++++++++++++++++++++++++--
src/include/access/xact.h | 2 +
3 files changed, 140 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b7ebcc2a55..0caaeae913 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -278,7 +278,7 @@ static bool currentCommandIdUsed;
*/
static TimestampTz xactStartTimestamp;
static TimestampTz stmtStartTimestamp;
-static TimestampTz xactStopTimestamp;
+TimestampTz xactStopTimestamp;
/*
* GID to be used for preparing the current transaction. This is also
@@ -325,6 +325,7 @@ typedef struct SubXactCallbackItem
static SubXactCallbackItem *SubXact_callbacks = NULL;
+xl_xact_commit *xlcommitrec = NULL;
/* local function prototypes */
static void AssignTransactionId(TransactionState s);
@@ -2214,6 +2215,9 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /* Reset xlcommitrec */
+ xlcommitrec = NULL;
+
ShowTransactionState("StartTransaction");
}
@@ -5831,6 +5835,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_twophase xl_twophase;
xl_xact_origin xl_origin;
uint8 info;
+ XLogRecPtr result;
Assert(CritSectionCount > 0);
@@ -5974,7 +5979,16 @@ XactLogCommitRecord(TimestampTz commit_time,
/* we allow filtering by xacts */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ /*
+ * Save the commit xlrec so that we can modify the xactStopTimestamp and
+ * the xact_time of the xlrec while holding the lock that determines the
+ * commit-LSN to ensure the commit timestamps are monotonically increasing.
+ */
+ xlcommitrec = &xlrec;
+ result = XLogInsert(RM_XACT_ID, info);
+ xlcommitrec = NULL;
+
+ return result;
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..8efad83df3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -155,6 +155,12 @@ int wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
*/
int CheckPointSegments;
+/*
+ * Whether the xact_time of xlrec is modified to ensure the commit timestamps
+ * are monotonically increasing.
+ */
+static bool XlogRecordModified = false;
+
/* Estimated distance between checkpoints, in bytes */
static double CheckPointDistanceEstimate = 0;
static double PrevCheckPointDistance = 0;
@@ -551,6 +557,14 @@ typedef struct XLogCtlData
XLogRecPtr lastFpwDisableRecPtr;
slock_t info_lck; /* locks shared variables shown above */
+
+ /*
+ * This is our shared, logical clock that we use to force
+ * commit timestamps to be monotonically increasing in
+ * commit-LSN order. This is protected by the Wal-insert
+ * spinlock.
+ */
+ TimestampTz lastXactStopTime;
} XLogCtlData;
/*
@@ -700,6 +714,7 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
XLogRecData *rdata,
XLogRecPtr StartPos, XLogRecPtr EndPos,
TimeLineID tli);
+static void XLogRecordCorrectCRC(XLogRecData *rdata);
static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -778,6 +793,12 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /*
+ * Make sure the flag telling that ReserveXLog...() modified the record is
+ * false at this point.
+ */
+ XlogRecordModified = false;
+
/*
* Given that we're not in recovery, InsertTimeLineID is set and can't
* change, so we can read it without a lock.
@@ -906,6 +927,15 @@ XLogInsertRecord(XLogRecData *rdata,
if (inserted)
{
+ /*
+ * If modified the XLog Record, recalculate the CRC.
+ */
+ if (XlogRecordModified)
+ {
+ XLogRecordCorrectCRC(rdata);
+ XlogRecordModified = false;
+ }
+
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
@@ -1086,6 +1116,25 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
+/*
+ * Function to recalculate the WAL Record's CRC in case it was altered to
+ * ensure a monotonically increasing commit timestamp in LSN order.
+ */
+static void
+XLogRecordCorrectCRC(XLogRecData *rdata)
+{
+ XLogRecData *rdt;
+ XLogRecord *rechdr = (XLogRecord *) rdata->data;
+ pg_crc32c rdata_crc;
+
+ INIT_CRC32C(rdata_crc);
+ COMP_CRC32C(rdata_crc, rdata->data + SizeOfXLogRecord, rdata->len - SizeOfXLogRecord);
+ for (rdt = rdata->next; rdt != NULL; rdt = rdt->next)
+ COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+
+ rechdr->xl_crc = rdata_crc;
+}
+
/*
* Reserves the right amount of space for a record of given size from the WAL.
* *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1112,6 +1161,10 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
uint64 startbytepos;
uint64 endbytepos;
uint64 prevbytepos;
+ TimestampTz orgxacttime = 0;
+
+ if (xlcommitrec)
+ orgxacttime = xlcommitrec->xact_time;
size = MAXALIGN(size);
@@ -1127,21 +1180,56 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
* positions (XLogRecPtrs) can be done outside the locked region, and
* because the usable byte position doesn't include any headers, reserving
* X bytes from WAL is almost as simple as "CurrBytePos += X".
+ *
+ * Note that for the commit record, we need to ensure that the commit
+ * timestamp is monotonically increased in the commit-LSN order to avoid
+ * inconsistency between the two.
*/
- SpinLockAcquire(&Insert->insertpos_lck);
+ if (xlcommitrec)
+ {
+ SpinLockAcquire(&Insert->insertpos_lck);
- startbytepos = Insert->CurrBytePos;
- endbytepos = startbytepos + size;
- prevbytepos = Insert->PrevBytePos;
- Insert->CurrBytePos = endbytepos;
- Insert->PrevBytePos = startbytepos;
+ /*
+ * This is a local transaction. Make sure that the xact_time higher
+ * than any timestamp we have seen thus far.
+ */
+ if (unlikely(XLogCtl->lastXactStopTime >= xlcommitrec->xact_time))
+ {
+ XLogCtl->lastXactStopTime++;
+ xlcommitrec->xact_time = XLogCtl->lastXactStopTime;
+ xactStopTimestamp = XLogCtl->lastXactStopTime;
+ }
+ else
+ XLogCtl->lastXactStopTime = xlcommitrec->xact_time;
- SpinLockRelease(&Insert->insertpos_lck);
+ startbytepos = Insert->CurrBytePos;
+ endbytepos = startbytepos + size;
+ prevbytepos = Insert->PrevBytePos;
+ Insert->CurrBytePos = endbytepos;
+ Insert->PrevBytePos = startbytepos;
+
+ SpinLockRelease(&Insert->insertpos_lck);
+ }
+ else
+ {
+ SpinLockAcquire(&Insert->insertpos_lck);
+
+ startbytepos = Insert->CurrBytePos;
+ endbytepos = startbytepos + size;
+ prevbytepos = Insert->PrevBytePos;
+ Insert->CurrBytePos = endbytepos;
+ Insert->PrevBytePos = startbytepos;
+
+ SpinLockRelease(&Insert->insertpos_lck);
+ }
*StartPos = XLogBytePosToRecPtr(startbytepos);
*EndPos = XLogBytePosToEndRecPtr(endbytepos);
*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+ if (xlcommitrec && orgxacttime != xlcommitrec->xact_time)
+ XlogRecordModified = true;
+
/*
* Check that the conversions between "usable byte positions" and
* XLogRecPtrs work consistently in both directions.
@@ -1170,12 +1258,20 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
uint32 size = MAXALIGN(SizeOfXLogRecord);
XLogRecPtr ptr;
uint32 segleft;
+ TimestampTz orgxacttime = 0;
+
+ if (xlcommitrec)
+ orgxacttime = xlcommitrec->xact_time;
/*
* These calculations are a bit heavy-weight to be done while holding a
* spinlock, but since we're holding all the WAL insertion locks, there
* are no other inserters competing for it. GetXLogInsertRecPtr() does
* compete for it, but that's not called very frequently.
+ *
+ * Note that for the commit record, we need to ensure that the commit
+ * timestamp is monotonically increased in the commit-LSN order to avoid
+ * inconsistency between the two.
*/
SpinLockAcquire(&Insert->insertpos_lck);
@@ -1189,6 +1285,22 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
return false;
}
+ if (xlcommitrec)
+ {
+ /*
+ * This is a local transaction. Make sure that the xact_time higher
+ * than any timestamp we have seen thus far.
+ */
+ if (unlikely(XLogCtl->lastXactStopTime >= xlcommitrec->xact_time))
+ {
+ XLogCtl->lastXactStopTime++;
+ xlcommitrec->xact_time = XLogCtl->lastXactStopTime;
+ xactStopTimestamp = XLogCtl->lastXactStopTime;
+ }
+ else
+ XLogCtl->lastXactStopTime = xlcommitrec->xact_time;
+ }
+
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -1209,6 +1321,9 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+ if (xlcommitrec && orgxacttime != xlcommitrec->xact_time)
+ XlogRecordModified = true;
+
Assert(XLogSegmentOffset(*EndPos, wal_segment_size) == 0);
Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a..43e4b5aa18 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -431,6 +431,8 @@ typedef struct xl_xact_parsed_abort
TimestampTz origin_timestamp;
} xl_xact_parsed_abort;
+extern PGDLLIMPORT xl_xact_commit *xlcommitrec;
+extern PGDLLIMPORT TimestampTz xactStopTimestamp;
/* ----------------
* extern definitions
--
2.30.0.windows.2
On Thu, Nov 7, 2024 at 9:39 PM Jan Wieck <jan@wi3ck.info> wrote:
On 11/6/24 21:30, Zhijie Hou (Fujitsu) wrote:
Thanks for the patch! I am reading the patch and noticed few minor things.
1. + /* + * This is a local transaction. Make sure that the xact_time + * higher than any timestamp we have seen thus far. + * + * TODO: This is not postmaster restart safe. If the local + * system clock is further behind other nodes than it takes + * for the postmaster to restart (time between it stops + * accepting new transactions and time when it becomes ready + * to accept new transactions), local transactions will not + * be bumped into the future correctly. + */The TODO section mentions other nodes, but I believe think patch currently do
not have the handling of timestamps for other nodes. Should we either remove
this part or add a brief explanation to clarify the relationship with other
nodes?That TODO is actually obsolete. I understood from Amit Kapila that the
community does assume that NTP synchronization is good enough.
This is my understanding from the relevant discussion in the email thread [1]/messages/by-id/2423650.1726842094@sss.pgh.pa.us -- With Regards, Amit Kapila..
[1]: /messages/by-id/2423650.1726842094@sss.pgh.pa.us -- With Regards, Amit Kapila.
--
With Regards,
Amit Kapila.
Hi,
On 2024-11-08 09:08:55 +0000, Zhijie Hou (Fujitsu) wrote:
On Friday, November 8, 2024 2:20 AM Andres Freund <andres@anarazel.de> wrote:
On 2024-11-05 08:58:36 -0500, Jan Wieck wrote:
The attached solution is minimally invasive because it doesn't move
the timestamp generation (clock_gettime() call) into the critical
section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead
it keeps track of the last commit-ts written to WAL in shared memory
and simply bumps that by one microsecond if the next one is below or
equal. There is one extra condition in that code section plus a
function call by pointer for every WAL record. In the unlikely case of
encountering a stalled or backwards running commit-ts, the expensive
part of recalculating the CRC of the altered commit WAL-record is done
later, after releasing the spinlock. I have not been able to measure
any performance impact on a machine with 2x Xeon-Silver (32 HT cores).I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.I understand your concern and appreciate the feedback. I've made some
adjustments to the patch by directly placing the code to adjust the commit
timestamp within the spinlock, aiming to keep it as efficient as possible. The
changes have resulted in just a few extra lines. Would this approach be
acceptable to you ?
No, not at all. I think it's basically not acceptable to add any nontrivial
instruction to the locked portion of ReserveXLogInsertLocation().
Before long we're going to have to replace that spinlocked instruction with
atomics, which will make it impossible to just block while holding the lock
anyway.
IMO nothing that touches core xlog insert infrastructure is acceptable for
this.
Greetings,
Andres Freund
On Fri, Nov 8, 2024 at 8:23 PM Andres Freund <andres@anarazel.de> wrote:
On 2024-11-08 09:08:55 +0000, Zhijie Hou (Fujitsu) wrote:
I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.I understand your concern and appreciate the feedback. I've made some
adjustments to the patch by directly placing the code to adjust the commit
timestamp within the spinlock, aiming to keep it as efficient as possible. The
changes have resulted in just a few extra lines. Would this approach be
acceptable to you ?No, not at all. I think it's basically not acceptable to add any nontrivial
instruction to the locked portion of ReserveXLogInsertLocation().Before long we're going to have to replace that spinlocked instruction with
atomics, which will make it impossible to just block while holding the lock
anyway.IMO nothing that touches core xlog insert infrastructure is acceptable for
this.
I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.
--
With Regards,
Amit Kapila.
Dear hackers,
I understand your concern and appreciate the feedback. I've made some
adjustments to the patch by directly placing the code to adjust the commit
timestamp within the spinlock, aiming to keep it as efficient as possible. The
changes have resulted in just a few extra lines. Would this approach be
acceptable to you ?Additionally, we're also doing performance tests for it and will share the
results once they're available.
I've done performance tests compared with master vs. v2 patch.
It showed that for small transactions cases, the performance difference was 0-2%,
which was almost the same of the run-by-run variation.
We may completely change the approach based on the recent discussion,
but let me share it once.
## Executed workload
Very small transactions with many clients were executed and results between master
and patched were compared. Two workloads were executed and compared:
- Insert single tuple to the table which does not have indices:
```
BEGIN;
INSERT INTO foo VALUES (1);
COMMIT;
```
- Emit a transactional logical replication message:
```
BEGIN;
SELECT pg_logical_emit_message(true, 'pgbench', 'benchmarking', false);
COMMIT;
```
## Results
Here is a result.
Each run had 60s periods and median of 5 runs were chosen.
### Single-tuple insertion
# of clients HEAD V2 diff
1 5602.828793 5593.991167 0.158%
2 10547.04503 10615.55583 -0.650%
4 15967.80926 15651.12383 1.983%
8 31213.14796 30584.75382 2.013%
16 60321.34215 59998.0144 0.536%
32 111108.2883 110615.9216 0.443%
64 171860.0186 171359.8172 0.291%
### Transactional message
# of clients HEAD V2 diff
1 5714.611827 5648.9299 1.149%
2 10651.26476 10677.2745 -0.244%
4 16137.30248 15984.11671 0.949%
8 31220.16833 30772.53673 1.434%
16 60364.22808 59866.92579 0.824%
32 111887.1129 111406.4923 0.430%
64 172136.76 171347.5658 0.458%
Actually the standard deviation of each runs was almost the same (0-2%), so we could
conclude that there were no significant difference.
## Appendix - measurement environments
- Used machine has 755GB memory and 120 cores (Intel(R) Xeon(R) CPU E7-4890).
- RHEL 7.9 is running on the machine
- HEAD pointed the commit 41b98ddb7 when tested.
- Only `--prefix` was specified for when configured.
- Changed parameters from the default were:
```
autovacuum = false
checkpoint_timeout = 1h
shared_buffers = '30GB'
max_wal_size = 20GB
min_wal_size = 10GB
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On 11/11/24 09:19, Amit Kapila wrote:
On Fri, Nov 8, 2024 at 8:23 PM Andres Freund <andres@anarazel.de> wrote:
On 2024-11-08 09:08:55 +0000, Zhijie Hou (Fujitsu) wrote:
I think it's *completely* unacceptable to call a hook inside
ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
contended section of code in postgres.I understand your concern and appreciate the feedback. I've made some
adjustments to the patch by directly placing the code to adjust the commit
timestamp within the spinlock, aiming to keep it as efficient as possible. The
changes have resulted in just a few extra lines. Would this approach be
acceptable to you ?No, not at all. I think it's basically not acceptable to add any nontrivial
instruction to the locked portion of ReserveXLogInsertLocation().Before long we're going to have to replace that spinlocked instruction with
atomics, which will make it impossible to just block while holding the lock
anyway.IMO nothing that touches core xlog insert infrastructure is acceptable for
this.I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.
I don't know what the solution is, isn't the problem that
(a) we record both values (LSN and timestamp) during commit
(b) reading timestamp from system clock can be quite expensive
It seems to me that if either of those two issues disappeared, we
wouldn't have such an issue.
For example, imagine getting a timestamp is super cheap - just reading
and updating a simple counter from shmem, just like we do for the LSN.
Wouldn't that solve the problem?
For example, let's say XLogCtlInsert has two fields
int64 CommitTimestamp;
and that ReserveXLogInsertLocation() also does this for each commit:
commit_timestamp = Insert->commit_timestamp++;
while holding the insertpos_lock. Now we have the LSN and timestamp
perfectly correlated. Of course, this timestamp would be completely
detached from "normal" timestamps, it'd just be a sequence. But we could
also once in a while "set" the timestamp to GetCurrentTimestamp(), or
something like that. For example, there could be a "ticker" process,
doing that every 1000ms, or 100ms, or whatever we think is enough.
Or maybe it could be driven by the commit itself, say when some timeout
expires, we assign too many items since the last commit, ...
Alternatively, we could simply stop relying on the timestamps recorded
in the commit, and instead derive "monotonic" commit timestamps after
the fact. For example, we could track timestamps for some subset of
commits, and then approximate the rest using LSN.
AFAIK the inversion can happen only for concurrent commits, and there's
can't be that many of those. So if we record the (LSN, timestamp) for
every 1MB of WAL, we approximate timestamps for commits in that 1MB by
linear approximation.
Of course, there's a lot of questions and details to solve - e.g. how
often would it need to happen, when exactly would it happen, etc. And
also how would that integrate with the logical decoding - it's easy to
just get the timestamp from the WAL record, this would require more work
to actually calculate it. It's only a very rough idea.
regards
--
Tomas Vondra
On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas@vondra.me> wrote:
On 11/11/24 09:19, Amit Kapila wrote:
I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.I don't know what the solution is, isn't the problem that
(a) we record both values (LSN and timestamp) during commit
(b) reading timestamp from system clock can be quite expensive
It seems to me that if either of those two issues disappeared, we
wouldn't have such an issue.For example, imagine getting a timestamp is super cheap - just reading
and updating a simple counter from shmem, just like we do for the LSN.
Wouldn't that solve the problem?
Yeah, this is exactly what I thought.
For example, let's say XLogCtlInsert has two fields
int64 CommitTimestamp;
and that ReserveXLogInsertLocation() also does this for each commit:
commit_timestamp = Insert->commit_timestamp++;
while holding the insertpos_lock. Now we have the LSN and timestamp
perfectly correlated.
Right, and the patch sent by Hou-San [1]/messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com (based on the original patch
by Jan) is somewhat on these lines. The idea you have shared or
implemented by the patch is a logical clock stored in shared memory.
So, what the patch does is that if the time recorded by the current
commit record is lesser than or equal to the logical clock (which
means after we record time in the commit code path and before we
reserve the LSN, there happens a concurrent transaction), we shall
increment the value of logical clock by one and use that as commit
time.
So, in most cases, we need to perform one additional "if check" and
"an assignment" under spinlock, and that too only for the commit
record. In some cases, when inversion happens, there will be "one
increment" and "two assignments."
Of course, this timestamp would be completely
detached from "normal" timestamps, it'd just be a sequence. But we could
also once in a while "set" the timestamp to GetCurrentTimestamp(), or
something like that. For example, there could be a "ticker" process,
doing that every 1000ms, or 100ms, or whatever we think is enough.Or maybe it could be driven by the commit itself, say when some timeout
expires, we assign too many items since the last commit, ...
If we follow the patch, we don't need this additional stuff. Isn't
what is proposed [1]/messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com quite similar to your idea? If so, we can save
this extra maintenance of commit timestamps.
[1]: /messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
With Regards,
Amit Kapila.
On Tue, Nov 12, 2024 at 8:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas@vondra.me> wrote:
On 11/11/24 09:19, Amit Kapila wrote:
I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.I don't know what the solution is, isn't the problem that
(a) we record both values (LSN and timestamp) during commit
(b) reading timestamp from system clock can be quite expensive
It seems to me that if either of those two issues disappeared, we
wouldn't have such an issue.For example, imagine getting a timestamp is super cheap - just reading
and updating a simple counter from shmem, just like we do for the LSN.
Wouldn't that solve the problem?Yeah, this is exactly what I thought.
For example, let's say XLogCtlInsert has two fields
int64 CommitTimestamp;
and that ReserveXLogInsertLocation() also does this for each commit:
commit_timestamp = Insert->commit_timestamp++;
while holding the insertpos_lock. Now we have the LSN and timestamp
perfectly correlated.Right, and the patch sent by Hou-San [1] (based on the original patch
by Jan) is somewhat on these lines. The idea you have shared or
implemented by the patch is a logical clock stored in shared memory.
So, what the patch does is that if the time recorded by the current
commit record is lesser than or equal to the logical clock (which
means after we record time in the commit code path and before we
reserve the LSN, there happens a concurrent transaction), we shall
increment the value of logical clock by one and use that as commit
time.So, in most cases, we need to perform one additional "if check" and
"an assignment" under spinlock, and that too only for the commit
record. In some cases, when inversion happens, there will be "one
increment" and "two assignments."
As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.
--
With Regards,
Amit Kapila.
On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas@vondra.me> wrote:
Alternatively, we could simply stop relying on the timestamps recorded
in the commit, and instead derive "monotonic" commit timestamps after
the fact. For example, we could track timestamps for some subset of
commits, and then approximate the rest using LSN.AFAIK the inversion can happen only for concurrent commits, and there's
can't be that many of those. So if we record the (LSN, timestamp) for
every 1MB of WAL, we approximate timestamps for commits in that 1MB by
linear approximation.Of course, there's a lot of questions and details to solve - e.g. how
often would it need to happen, when exactly would it happen, etc. And
also how would that integrate with the logical decoding - it's easy to
just get the timestamp from the WAL record, this would require more work
to actually calculate it. It's only a very rough idea.
I think for logical decoding it would be probably easy because it
reads all the WAL. So, it can remember the commit time of the previous
commit, and if any future commit has a commit timestamp lower than
that it can fix it by incrementing it. But outside logical decoding,
it would be tricky because we may need a separate process to fix up
commit timestamps by using linear approximation. IIUC, the bigger
challenge is that such a solution would require us to read the WAL on
a continuous basis and keep fixing commit timestamps or we need to
read the extra WAL before using or relying on commit timestamp. This
sounds to be a somewhat complex and costlier solution though the cost
is outside the hot-code path but still, it matters as it leads to
extra read I/O.
--
With Regards,
Amit Kapila.
On 11/12/24 04:05, Amit Kapila wrote:
On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas@vondra.me> wrote:
On 11/11/24 09:19, Amit Kapila wrote:
I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.I don't know what the solution is, isn't the problem that
(a) we record both values (LSN and timestamp) during commit
(b) reading timestamp from system clock can be quite expensive
It seems to me that if either of those two issues disappeared, we
wouldn't have such an issue.For example, imagine getting a timestamp is super cheap - just reading
and updating a simple counter from shmem, just like we do for the LSN.
Wouldn't that solve the problem?Yeah, this is exactly what I thought.
For example, let's say XLogCtlInsert has two fields
int64 CommitTimestamp;
and that ReserveXLogInsertLocation() also does this for each commit:
commit_timestamp = Insert->commit_timestamp++;
while holding the insertpos_lock. Now we have the LSN and timestamp
perfectly correlated.Right, and the patch sent by Hou-San [1] (based on the original patch
by Jan) is somewhat on these lines. The idea you have shared or
implemented by the patch is a logical clock stored in shared memory.
So, what the patch does is that if the time recorded by the current
commit record is lesser than or equal to the logical clock (which
means after we record time in the commit code path and before we
reserve the LSN, there happens a concurrent transaction), we shall
increment the value of logical clock by one and use that as commit
time.So, in most cases, we need to perform one additional "if check" and
"an assignment" under spinlock, and that too only for the commit
record. In some cases, when inversion happens, there will be "one
increment" and "two assignments."
In a way, yes. Except that each backend first sets the commit timestamp
the usual way too. And then it also has to recalculate the checksum of
the record - which happens under lwlock and not the main spinlock, but
it still seems it might be quite expensive.
I wonder how often we'd need to do this - it seems to me that for high
throughput systems we might easily get into a situation where we have to
"correct" the timestamp and recalculate the checksum (so calculating it
twice) for most of the commits.
Would be good to see some numbers from experiments - how often this can
happen, what's the impact on throughput etc.
Ofc, you may argue that this would only affect people with the hook, and
those people may be OK with the impact. And in a way I agree. But I also
understand the concerns expressed by Andres, that adding a hook here may
be problematic / easy to get wrong, and then we'd be the ones holding
the brown bag ...
Of course, this timestamp would be completely
detached from "normal" timestamps, it'd just be a sequence. But we could
also once in a while "set" the timestamp to GetCurrentTimestamp(), or
something like that. For example, there could be a "ticker" process,
doing that every 1000ms, or 100ms, or whatever we think is enough.Or maybe it could be driven by the commit itself, say when some timeout
expires, we assign too many items since the last commit, ...If we follow the patch, we don't need this additional stuff. Isn't
what is proposed [1] quite similar to your idea? If so, we can save
this extra maintenance of commit timestamps.
Not sure. But I think we already have some of the necessary parts in
commit_ts. It'd need some improvements, but it might also be a good
place to adjust the timestamps. The inversion can only happens within a
small group of recent transactions (due to NUM_XLOGINSERT_LOCKS), so it
shouldn't be hard to keep a small buffer for those XIDs. Ofc, subxacts
would complicate stuff, and it'd mean the timestamps written to WAL and
to commit_ts would differ. Not great.
regards
--
Tomas Vondra
On 11/11/24 23:21, Amit Kapila wrote:
As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.
It can indeed be reduced to one extra *unlikely* if test only for commit
records and only when WAL level is "logical". Without those two being
true there would be zero impact on ReserveXLogInsertLocation().
It is not possible to do something on the subscription level because it
affects global behavior of all backends.
Best Regards, Jan
Hi,
On 2024-11-12 08:51:49 -0500, Jan Wieck wrote:
On 11/11/24 23:21, Amit Kapila wrote:
As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.It can indeed be reduced to one extra *unlikely* if test only for commit
records and only when WAL level is "logical". Without those two being true
there would be zero impact on ReserveXLogInsertLocation().
No, the impact would be far larger, because it would make it impractical to
remove this bottleneck by using atomic instructions in
ReserveXLogInsertLocation(). It doesn't make sense to make it harder to
resolve one of the core central bottlenecks in postgres for a niche feature
like this. Sorry, but this is just not a viable path.
Greetings,
Andres Freund
Hi,
On 2024-11-11 09:49:19 +0000, Hayato Kuroda (Fujitsu) wrote:
I've done performance tests compared with master vs. v2 patch.
It showed that for small transactions cases, the performance difference was 0-2%,
which was almost the same of the run-by-run variation.We may completely change the approach based on the recent discussion,
but let me share it once.## Executed workload
Very small transactions with many clients were executed and results between master
and patched were compared. Two workloads were executed and compared:- Insert single tuple to the table which does not have indices:
```
BEGIN;
INSERT INTO foo VALUES (1);
COMMIT;
```- Emit a transactional logical replication message:
```
BEGIN;
SELECT pg_logical_emit_message(true, 'pgbench', 'benchmarking', false);
COMMIT;
```## Results
This is not a useful measurement for overhead introduced in
ReserveXLogInsertLocation(). What you're measuring here is the number of
commits/second, not the WAL insertion rate. The number of commits/second is
largely determined by your disk's write latency, the batching of WAL flushes
and things like the SLRU code.
To measure the effect of changes to ReserveXLogInsertLocation() use something
like this as a pgbench script:
SELECT pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1, 10000) OFFSET 10000;
Greetings,
Andres Freund
On 11/11/24 23:22, Amit Kapila wrote:
On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra <tomas@vondra.me> wrote:
Alternatively, we could simply stop relying on the timestamps recorded
in the commit, and instead derive "monotonic" commit timestamps after
the fact. For example, we could track timestamps for some subset of
commits, and then approximate the rest using LSN.AFAIK the inversion can happen only for concurrent commits, and there's
can't be that many of those. So if we record the (LSN, timestamp) for
every 1MB of WAL, we approximate timestamps for commits in that 1MB by
linear approximation.Of course, there's a lot of questions and details to solve - e.g. how
often would it need to happen, when exactly would it happen, etc. And
also how would that integrate with the logical decoding - it's easy to
just get the timestamp from the WAL record, this would require more work
to actually calculate it. It's only a very rough idea.I think for logical decoding it would be probably easy because it
reads all the WAL. So, it can remember the commit time of the previous
commit, and if any future commit has a commit timestamp lower than
that it can fix it by incrementing it. But outside logical decoding,
it would be tricky because we may need a separate process to fix up
commit timestamps by using linear approximation. IIUC, the bigger
challenge is that such a solution would require us to read the WAL on
a continuous basis and keep fixing commit timestamps or we need to
read the extra WAL before using or relying on commit timestamp. This
sounds to be a somewhat complex and costlier solution though the cost
is outside the hot-code path but still, it matters as it leads to
extra read I/O.
I had originally experimented with adjusting the timestamps sent to the
subscriber in the output plugin. This works for the most part but has
two major flaws:
1) In case the system is restarted in a situation where it has some
forward clock skew and the restart LSN is right in the middle of it, the
output plugin has no knowledge of that and again emits backwards running
timestamps.
2) The origin and the subscriber now have a different timestamp
associated with the same transaction. That can lead to different
behavior in conflict resolution when a third node gets involved.
Best Regards, Jan
Hello,
On 11/12/24 08:55, Andres Freund wrote:
Hi,
On 2024-11-12 08:51:49 -0500, Jan Wieck wrote:
On 11/11/24 23:21, Amit Kapila wrote:
As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.It can indeed be reduced to one extra *unlikely* if test only for commit
records and only when WAL level is "logical". Without those two being true
there would be zero impact on ReserveXLogInsertLocation().No, the impact would be far larger, because it would make it impractical to
remove this bottleneck by using atomic instructions in
ReserveXLogInsertLocation(). It doesn't make sense to make it harder to
resolve one of the core central bottlenecks in postgres for a niche feature
like this. Sorry, but this is just not a viable path.
The current section of code covered by the spinlock performs the
following operations:
- read CurrBytePos (shmem)
- read PrevBytePos (shmem)
- store CurrBytePos in PrevBytePos
- increment CurrBytePos by size
In addition there is ReserveXLogSwitch() that is far more complex and is
guarded by the same spinlock because both functions manipulate the same
shared memory variables. With that in mind, how viable is the proposal
to replace these code sections in both functions with atomic operations?
Best Regards, Jan
Hi,
On 2024-11-12 10:12:40 -0500, Jan Wieck wrote:
On 11/12/24 08:55, Andres Freund wrote:
Hi,
On 2024-11-12 08:51:49 -0500, Jan Wieck wrote:
On 11/11/24 23:21, Amit Kapila wrote:
As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.It can indeed be reduced to one extra *unlikely* if test only for commit
records and only when WAL level is "logical". Without those two being true
there would be zero impact on ReserveXLogInsertLocation().No, the impact would be far larger, because it would make it impractical to
remove this bottleneck by using atomic instructions in
ReserveXLogInsertLocation(). It doesn't make sense to make it harder to
resolve one of the core central bottlenecks in postgres for a niche feature
like this. Sorry, but this is just not a viable path.The current section of code covered by the spinlock performs the following
operations:- read CurrBytePos (shmem)
- read PrevBytePos (shmem)
- store CurrBytePos in PrevBytePos
- increment CurrBytePos by size
In addition there is ReserveXLogSwitch() that is far more complex and is
guarded by the same spinlock because both functions manipulate the same
shared memory variables. With that in mind, how viable is the proposal to
replace these code sections in both functions with atomic operations?
I have working code - pretty ugly at this state, but mostly needs a fair bit
of elbow grease not divine inspiration... It's not a trivial change, but
entirely doable.
The short summary of how it works is that it uses a single 64bit atomic that
is internally subdivided into a ringbuffer position in N high bits and an
offset from a base LSN in the remaining bits. The insertion sequence is
1) xadd of [increment ringbuffer by 1] + [size of record] to the
insertion position state variable.
This assigns a ringbuffer position and an insertion LSN (by adding the
position offset to the base pointer) to the inserting backend.
2) The inserting backend sets ringbuffer[pos].oldpos = oldpos
3) The inserting backend gets the prior record's pos from
ringbuffer[pos - 1].oldpos, this might need to wait for the prior
inserter to set ringbuffer[pos - 1].oldpos.
Note that this does *not* have to happen at the point we currently get the
prev pointer. Instead we can start copying the record into place and then
acquire the prev pointer and update the record with it. That makes it
extremely rare to have to wait for the prev pointer to be set.
This leaves you with a single xadd to contended cacheline as the contention
point (scales far better than cmpxchg and far far better than
cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
and read, but it's only by two backends, not all of them.
The nice part is this scheme leaves you with a ringbuffer that's ordered by
the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more
efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion
locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at
the same time increasing it makes the contention on the spinlock *much* worse,
leading to slowdowns in other workloads.
The above description leaves out a bunch of complexity, of course - we
e.g. need to make sure that ringbuffer positions can't be reused too eagerly
and we need to to update the base pointer.
Greetings,
Andres Freund
Hello,
On 11/12/24 10:34, Andres Freund wrote:
I have working code - pretty ugly at this state, but mostly needs a fair bit
of elbow grease not divine inspiration... It's not a trivial change, but
entirely doable.The short summary of how it works is that it uses a single 64bit atomic that
is internally subdivided into a ringbuffer position in N high bits and an
offset from a base LSN in the remaining bits. The insertion sequence is...
This leaves you with a single xadd to contended cacheline as the contention
point (scales far better than cmpxchg and far far better than
cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
and read, but it's only by two backends, not all of them.
That sounds rather promising.
Would it be reasonable to have both implementations available at least
at compile time, if not at runtime? Is it possible that we need to do
that anyway for some time or are those atomic operations available on
all supported CPU architectures?
The nice part is this scheme leaves you with a ringbuffer that's ordered by
the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more
efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion
locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at
the same time increasing it makes the contention on the spinlock *much* worse,
leading to slowdowns in other workloads.
Yeah, that is a complex wart that I believe was the answer to the NUMA
overload that Kevin Grittner and myself discovered many years ago, where
on a 4-socket machine the cacheline stealing would get so bad that
whoever was holding the lock could not release it.
In any case, thanks for the input. Looks like in the long run we need to
come up with a different way to solve the inversion problem.
Best Regards, Jan
Hi,
On 2024-11-12 11:40:39 -0500, Jan Wieck wrote:
On 11/12/24 10:34, Andres Freund wrote:
I have working code - pretty ugly at this state, but mostly needs a fair bit
of elbow grease not divine inspiration... It's not a trivial change, but
entirely doable.The short summary of how it works is that it uses a single 64bit atomic that
is internally subdivided into a ringbuffer position in N high bits and an
offset from a base LSN in the remaining bits. The insertion sequence is...
This leaves you with a single xadd to contended cacheline as the contention
point (scales far better than cmpxchg and far far better than
cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
and read, but it's only by two backends, not all of them.That sounds rather promising.
Would it be reasonable to have both implementations available at least at
compile time, if not at runtime?
No, not reasonably.
Is it possible that we need to do that anyway for some time or are those
atomic operations available on all supported CPU architectures?
We have a fallback atomics implementation for the uncommon architectures
without 64bit atomics.
In any case, thanks for the input. Looks like in the long run we need to
come up with a different way to solve the inversion problem.
IMO there's absolutely no way the changes proposed in this thread so far
should get merged.
Greetings,
Andres Freund
Dear Andres,
Thanks for giving comments for my test!
This is not a useful measurement for overhead introduced in
ReserveXLogInsertLocation(). What you're measuring here is the number of
commits/second, not the WAL insertion rate. The number of commits/second is
largely determined by your disk's write latency, the batching of WAL flushes
and things like the SLRU code.
You meant that committing transactions is huge work so that it may hide performance
regression of ReserveXLogInsertLocation(), right? But...
To measure the effect of changes to ReserveXLogInsertLocation() use something
like this as a pgbench script:
SELECT pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1,
10000) OFFSET 10000;
I don't think the suggested workload is useful here. pg_logical_emit_message(transactional = false)
does insert the WAL without the commit, i.e., xlcommitrecis always NULL.
This means backends won't go through added codes in ReserveXLogInsertLocation().
## Appendix - gdb results
1. connected a postgres instance
2. attached the backend process via gdb
3. added a breakpoint to XLogInsert
4. executed a command: SELECT pg_logical_emit_message(false, '1', '1'), generate_series(1, 10000) OFFSET 10000;
5. the process stopped at the break point. Backtrace was:
```
(gdb) bt
#0 XLogInsert (rmid=21 '\025', info=0 '\000') at xloginsert.c:479
#1 0x000000000094a640 in LogLogicalMessage (prefix=0x23adfd0 "1", message=0x23a34c4 "1~\177\177\060", size=1,
transactional=false, flush=false) at message.c:72
#2 0x000000000094a545 in pg_logical_emit_message_bytea (fcinfo=0x23a65d0) at logicalfuncs.c:376
#3 0x000000000094a56f in pg_logical_emit_message_text (fcinfo=0x23a65d0) at logicalfuncs.c:385
```
6. confirmed xlcommitrec was NULL
```
(gdb) p xlcommitrec
$1 = (xl_xact_commit *) 0x0
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Andres,
I don't think the suggested workload is useful here.
pg_logical_emit_message(transactional = false)
does insert the WAL without the commit, i.e., xlcommitrecis always NULL.
This means backends won't go through added codes in
ReserveXLogInsertLocation().
Just in case I want to clarify what I used.
My head was bfeeb06 and I applied [1]/messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com atop it.
[1]: /messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Tue, Nov 12, 2024 at 5:30 PM Tomas Vondra <tomas@vondra.me> wrote:
On 11/12/24 04:05, Amit Kapila wrote:
Right, and the patch sent by Hou-San [1] (based on the original patch
by Jan) is somewhat on these lines. The idea you have shared or
implemented by the patch is a logical clock stored in shared memory.
So, what the patch does is that if the time recorded by the current
commit record is lesser than or equal to the logical clock (which
means after we record time in the commit code path and before we
reserve the LSN, there happens a concurrent transaction), we shall
increment the value of logical clock by one and use that as commit
time.So, in most cases, we need to perform one additional "if check" and
"an assignment" under spinlock, and that too only for the commit
record. In some cases, when inversion happens, there will be "one
increment" and "two assignments."In a way, yes. Except that each backend first sets the commit timestamp
the usual way too. And then it also has to recalculate the checksum of
the record - which happens under lwlock and not the main spinlock, but
it still seems it might be quite expensive.I wonder how often we'd need to do this - it seems to me that for high
throughput systems we might easily get into a situation where we have to
"correct" the timestamp and recalculate the checksum (so calculating it
twice) for most of the commits.Would be good to see some numbers from experiments - how often this can
happen, what's the impact on throughput etc.
The initial numbers shared by Kuroda-San [1]/messages/by-id/TYAPR01MB569222C1312E7A6C3C63539DF5582@TYAPR01MB5692.jpnprd01.prod.outlook.com didn't show any
significant impact due to this. It will likely happen in some cases
under stress but shouldn't be often because normally the backend that
acquired the timestamp first should reserve the LSN first as well. We
can do more tests as well to see the impact but the key point Andres
is raising is that we won't be able to convert the operation in
ReserveXLogInsertLocation() to use atomics after such a solution. Now,
even, if the change is only in the *commit* code path, we may or may
not be able to maintain two code paths for reserving LSN, one using
atomics and the other (for commit record) using spinlock.
I think even if we consider your solution to directly increment the
commit_timestamp in spinlock to make it monotonic and then later
rectify the commit_ts, will it be any more helpful in making the
entire operation atomics?
Ofc, you may argue that this would only affect people with the hook, and
those people may be OK with the impact. And in a way I agree. But I also
understand the concerns expressed by Andres, that adding a hook here may
be problematic / easy to get wrong, and then we'd be the ones holding
the brown bag ...
BTW, the latest patch [2]/messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com doesn't have any hook.
Of course, this timestamp would be completely
detached from "normal" timestamps, it'd just be a sequence. But we could
also once in a while "set" the timestamp to GetCurrentTimestamp(), or
something like that. For example, there could be a "ticker" process,
doing that every 1000ms, or 100ms, or whatever we think is enough.Or maybe it could be driven by the commit itself, say when some timeout
expires, we assign too many items since the last commit, ...If we follow the patch, we don't need this additional stuff. Isn't
what is proposed [1] quite similar to your idea? If so, we can save
this extra maintenance of commit timestamps.Not sure. But I think we already have some of the necessary parts in
commit_ts. It'd need some improvements, but it might also be a good
place to adjust the timestamps. The inversion can only happens within a
small group of recent transactions (due to NUM_XLOGINSERT_LOCKS), so it
shouldn't be hard to keep a small buffer for those XIDs. Ofc, subxacts
would complicate stuff, and it'd mean the timestamps written to WAL and
to commit_ts would differ. Not great.
Won't it be difficult to predict the number of transactions that could
be impacted in this model?
In general, I understand the concern here related to the difficulty in
converting the operation in ReserveXLogInsertLocation() to atomics but
leaving the LSN<->Timestamp inversion issue open forever also doesn't
give any good feeling. I feel any solution to fixup commit_timestamps
afterward as we discussed above [3]/messages/by-id/CAA4eK1+OZLh7vA1CQkoq0ba4J_P-8JFHnt0a_YC2xfB0t3+akA@mail.gmail.com would be much more complex and
difficult to maintain.
[1]: /messages/by-id/TYAPR01MB569222C1312E7A6C3C63539DF5582@TYAPR01MB5692.jpnprd01.prod.outlook.com
[2]: /messages/by-id/OS0PR01MB57162A227EC357482FEB4470945D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
[3]: /messages/by-id/CAA4eK1+OZLh7vA1CQkoq0ba4J_P-8JFHnt0a_YC2xfB0t3+akA@mail.gmail.com
--
With Regards,
Amit Kapila.
On 11/13/24 03:56, Amit Kapila wrote:
the key point Andres
is raising is that we won't be able to convert the operation in
ReserveXLogInsertLocation() to use atomics after such a solution. Now,
even, if the change is only in the *commit* code path, we may or may
not be able to maintain two code paths for reserving LSN, one using
atomics and the other (for commit record) using spinlock.
Which is only partially true. There is nothing that would prevent us
from using atomic operations inside of a spinlock and only reserving
xlog space for commit records needs a spinlock because of the extra
logic that cannot be combined into a single atomic operation. The idea
as I understand Andres is that changing ReserveXLogInsertLocation() to
use atomics gets rid not only of the spinlock, but also the LW-locks
that protect it from a spinlock storm.
Keeping the current LW-lock plus spinlock architecture only for commit
records but changing the actual reserve to atomic operations would
affect small transactions more than large ones. Making all of this
depend on "wal_level = logical" removes the argument that the two
solutions are mutually exclusive. It does however make the code less
maintenance friendly.
Best Regards, Jan