New WAL record to detect the checkpoint redo location

Started by Dilip Kumaralmost 3 years ago39 messageshackers
Jump to latest
#1Dilip Kumar
dilipbalaut@gmail.com

As discussed [1 ][2]/messages/by-id/20230614194717.jyuw3okxup4cvtbt@awork3.anarazel.de currently, the checkpoint-redo LSN can not be
accurately detected while processing the WAL. Although we have a
checkpoint WAL record containing the exact redo LSN, other WAL records
may be inserted between the checkpoint-redo LSN and the actual
checkpoint record. If we want to stop processing wal exactly at the
checkpoint-redo location then we cannot do that because we would have
already processed some extra records that got added after the redo
LSN.

The patch inserts a special wal record named CHECKPOINT_REDO WAL,
which is located exactly at the checkpoint-redo location. We can
guarantee this record to be exactly at the checkpoint-redo point
because we already hold the exclusive WAL insertion lock while
identifying the checkpoint redo point and can insert this special
record exactly at the same time so that there are no concurrent WAL
insertions.

[1]: /messages/by-id/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
[2]: /messages/by-id/20230614194717.jyuw3okxup4cvtbt@awork3.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-New-WAL-record-to-identify-check-redo-location.patchapplication/octet-stream; name=v1-0001-New-WAL-record-to-identify-check-redo-location.patchDownload+62-8
#2Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#1)
Re: New WAL record to detect the checkpoint redo location

Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2430f617c..a025fb91e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata,
XLogRecPtr	StartPos;
XLogRecPtr	EndPos;
bool		prevDoPageWrites = doPageWrites;
+	bool		callerHoldingExlock = holdingAllLocks;
TimeLineID	insertTLI;
/* we assume that all of the record header is in the first chunk */
@@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata,
*----------
*/
START_CRIT_SECTION();
-	if (isLogSwitch)
-		WALInsertLockAcquireExclusive();
-	else
-		WALInsertLockAcquire();
+
+	/*
+	 * Acquire wal insertion lock, nothing to do if the caller is already
+	 * holding the exclusive lock.
+	 */
+	if (!callerHoldingExlock)
+	{
+		if (isLogSwitch)
+			WALInsertLockAcquireExclusive();
+		else
+			WALInsertLockAcquire();
+	}

/*
* Check to see if my copy of RedoRecPtr is out of date. If so, may have

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.

@@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

+	/*
+	 * Insert a special purpose CHECKPOINT_REDO record as the first record at
+	 * checkpoint redo lsn.  Although we have the checkpoint record that
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 *
+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.
+	 *
+	 * This record is guaranteed to be the first record at checkpoint redo lsn
+	 * because we are inserting this while holding the exclusive wal insertion
+	 * lock.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+	}

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

Greetings,

Andres Freund

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#2)
Re: New WAL record to detect the checkpoint redo location

On Fri, Jul 14, 2023 at 8:46 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

Thanks for looking into it.

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.

Okay, Now I have moved the XlogInsert for the special record outside
the WalInsertLock so we don't need this special handling here.

@@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

+     /*
+      * Insert a special purpose CHECKPOINT_REDO record as the first record at
+      * checkpoint redo lsn.  Although we have the checkpoint record that
+      * contains the exact redo lsn, there might have been some other records
+      * those got inserted between the redo lsn and the actual checkpoint
+      * record.  So when processing the wal, we cannot rely on the checkpoint
+      * record if we want to stop at the checkpoint-redo LSN.
+      *
+      * This special record, however, is not required when we doing a shutdown
+      * checkpoint, as there will be no concurrent wal insertions during that
+      * time.  So, the shutdown checkpoint LSN will be the same as
+      * checkpoint-redo LSN.
+      *
+      * This record is guaranteed to be the first record at checkpoint redo lsn
+      * because we are inserting this while holding the exclusive wal insertion
+      * lock.
+      */
+     if (!shutdown)
+     {
+             int                     dummy = 0;
+
+             XLogBeginInsert();
+             XLogRegisterData((char *) &dummy, sizeof(dummy));
+             recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+     }

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v2-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload+38-44
#4Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#3)
Re: New WAL record to detect the checkpoint redo location

On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

Looking at this patch, I am bit surprised to see that the redo point
maps with the end location of the CHECKPOINT_REDO record as it is the
LSN returned by XLogInsert(), not its start LSN. For example after a
checkpoint:
=# CREATE EXTENSION pg_walinspect;
CREATE EXTENSION;
=# SELECT redo_lsn, checkpoint_lsn from pg_control_checkpoint();
redo_lsn | checkpoint_lsn
-----------+----------------
0/19129D0 | 0/1912A08
(1 row)
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
from pg_get_wal_record_info('0/19129D0');
start_lsn | prev_lsn | end_lsn | record_type
-----------+-----------+-----------+---------------
0/19129D0 | 0/19129B0 | 0/1912A08 | RUNNING_XACTS
(1 row)

In this case it matches with the previous record:
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
from pg_get_wal_record_info('0/19129B0');
start_lsn | prev_lsn | end_lsn | record_type
-----------+-----------+-----------+-----------------
0/19129B0 | 0/1912968 | 0/19129D0 | CHECKPOINT_REDO
(1 row)

This could be used to cross-check that the first record replayed is of
the correct type. The commit message of this patch tells that "the
checkpoint-redo location is set at LSN of this record", which implies
the start LSN of the record tracked as the redo LSN, not the end of
it? What's the intention here?
--
Michael

#5Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#4)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 17, 2023 at 10:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

Looking at this patch, I am bit surprised to see that the redo point
maps with the end location of the CHECKPOINT_REDO record as it is the
LSN returned by XLogInsert(), not its start LSN.

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#6Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#5)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?
--
Michael

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#6)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?

Yeah right, we can use ProcLastRecPtr. I will send the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#7)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 18, 2023 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?

Yeah right, we can use ProcLastRecPtr. I will send the updated patch.

Here is the updated version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v3-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload+58-15
#9Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#8)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:

Here is the updated version of the patch.

The concept of the patch looks sound to me. I have a few comments.

+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.

This is missing "are", as in "when we are doing a shutdown
checkpoint".

- freespace = INSERT_FREESPACE(curInsert);
- if (freespace == 0)

The variable "freespace" can be moved within the if block introduced
by this patch when calculating the REDO location for the shutdown
case. And you can do the same with curInsert?

-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * In case of shutdown checkpoint, compute new REDO record
+	 * ptr = location of next XLOG record.

It seems to me that keeping around this comment is important,
particularly for the case where we have a shutdown checkpoint and we
expect nothing to run, no?

How about adding a test in pg_walinspect?  There is an online
checkpoint running there, so you could just add something like that
to check that the REDO record is at the expected LSN stored in the
control file:
+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
--
Michael
#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#9)
Re: New WAL record to detect the checkpoint redo location

On Mon, Aug 28, 2023 at 5:14 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:

Here is the updated version of the patch.

The concept of the patch looks sound to me. I have a few comments.

Thanks for the review

+        * This special record, however, is not required when we doing a shutdown
+        * checkpoint, as there will be no concurrent wal insertions during that
+        * time.  So, the shutdown checkpoint LSN will be the same as
+        * checkpoint-redo LSN.

This is missing "are", as in "when we are doing a shutdown
checkpoint".

Fixed

- freespace = INSERT_FREESPACE(curInsert);
- if (freespace == 0)

The variable "freespace" can be moved within the if block introduced
by this patch when calculating the REDO location for the shutdown
case. And you can do the same with curInsert?

Done, I have also moved code related to computing curInsert in the
same if (shutdown) block.

-        * Compute new REDO record ptr = location of next XLOG record.
-        *
-        * NB: this is NOT necessarily where the checkpoint record itself will be,
-        * since other backends may insert more XLOG records while we're off doing
-        * the buffer flush work.  Those XLOG records are logically after the
-        * checkpoint, even though physically before it.  Got that?
+        * In case of shutdown checkpoint, compute new REDO record
+        * ptr = location of next XLOG record.

It seems to me that keeping around this comment is important,
particularly for the case where we have a shutdown checkpoint and we
expect nothing to run, no?

I removed this mainly because now in other comments[1]+ /* + * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record + * as checkpoint.redo. Although we have the checkpoint record that also + * contains the exact redo lsn, there might have been some other records + * those got inserted between the redo lsn and the actual checkpoint + * record. So when processing the wal, we cannot rely on the checkpoint + * record if we want to stop at the checkpoint-redo LSN. + * + * This special record, however, is not required when we are doing a + * shutdown checkpoint, as there will be no concurrent wal insertions + * during that time. So, the shutdown checkpoint LSN will be the same as + * checkpoint-redo LSN. + */ where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required. And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

[1]
+ /*
+ * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+ * as checkpoint.redo.  Although we have the checkpoint record that also
+ * contains the exact redo lsn, there might have been some other records
+ * those got inserted between the redo lsn and the actual checkpoint
+ * record.  So when processing the wal, we cannot rely on the checkpoint
+ * record if we want to stop at the checkpoint-redo LSN.
+ *
+ * This special record, however, is not required when we are doing a
+ * shutdown checkpoint, as there will be no concurrent wal insertions
+ * during that time.  So, the shutdown checkpoint LSN will be the same as
+ * checkpoint-redo LSN.
+ */
How about adding a test in pg_walinspect?  There is an online
checkpoint running there, so you could just add something like that
to check that the REDO record is at the expected LSN stored in the
control file:
+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
--

Done, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v4-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload+71-17
#11Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#10)
Re: New WAL record to detect the checkpoint redo location

On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:

I removed this mainly because now in other comments[1] where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required.
And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

+    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+    * as checkpoint.redo.

I would add a "for a non-shutdown checkpoint" at the end of this
sentence.

+    * record.  So when processing the wal, we cannot rely on the checkpoint
+    * record if we want to stop at the checkpoint-redo LSN.

The term "checkpoint-redo" is also a bit confusing, I guess, because
you just mean to refer to the "redo" LSN here? Maybe rework the last
sentence as:
"So, when processing WAL, we cannot rely on the checkpoint record if
we want to stop at the same position as the redo LSN".

+    * This special record, however, is not required when we are doing a
+    * shutdown checkpoint, as there will be no concurrent wal insertions
+    * during that time.  So, the shutdown checkpoint LSN will be the same as
+    * checkpoint-redo LSN.

Perhaps the last sentence could be merged with the first one, if we
are tweaking things, say:
"This special record is not required when doing a shutdown checkpoint;
the redo LSN is the same LSN as the checkpoint record as there cannot
be any WAL activity in a shutdown sequence."
--
Michael

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#11)
Re: New WAL record to detect the checkpoint redo location

On Wed, Aug 30, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:

I removed this mainly because now in other comments[1] where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required.
And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

+    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+    * as checkpoint.redo.

I would add a "for a non-shutdown checkpoint" at the end of this
sentence.

+    * record.  So when processing the wal, we cannot rely on the checkpoint
+    * record if we want to stop at the checkpoint-redo LSN.

The term "checkpoint-redo" is also a bit confusing, I guess, because
you just mean to refer to the "redo" LSN here? Maybe rework the last
sentence as:
"So, when processing WAL, we cannot rely on the checkpoint record if
we want to stop at the same position as the redo LSN".

+    * This special record, however, is not required when we are doing a
+    * shutdown checkpoint, as there will be no concurrent wal insertions
+    * during that time.  So, the shutdown checkpoint LSN will be the same as
+    * checkpoint-redo LSN.

Perhaps the last sentence could be merged with the first one, if we
are tweaking things, say:
"This special record is not required when doing a shutdown checkpoint;
the redo LSN is the same LSN as the checkpoint record as there cannot
be any WAL activity in a shutdown sequence."

Your suggestions LGTM so modified accordingly

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v5-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload+71-17
#13Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#12)
Re: New WAL record to detect the checkpoint redo location

On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:

Your suggestions LGTM so modified accordingly

I have been putting my HEAD on this patch for a few hours, reviewing
the surroundings, and somewhat missed that this computation is done
while we do not hold the WAL insert locks:
+ checkPoint.redo = ProcLastRecPtr;

Then a few lines down the shared Insert.RedoRecPtr is updated while
holding an exclusive lock.
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

If we have a bunch of records inserted between the moment when the
REDO record is inserted and the moment when the checkpointer takes the
exclusive WAL lock, aren't we potentially missing a lot of FPW's that
should exist since the redo LSN?
--
Michael

#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#13)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 31, 2023 at 9:36 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:

Your suggestions LGTM so modified accordingly

I have been putting my HEAD on this patch for a few hours, reviewing
the surroundings, and somewhat missed that this computation is done
while we do not hold the WAL insert locks:
+ checkPoint.redo = ProcLastRecPtr;

Then a few lines down the shared Insert.RedoRecPtr is updated while
holding an exclusive lock.
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

If we have a bunch of records inserted between the moment when the
REDO record is inserted and the moment when the checkpointer takes the
exclusive WAL lock, aren't we potentially missing a lot of FPW's that
should exist since the redo LSN?

Yeah, good catch. With this, it seems like we can not move this new
WAL Insert out of the Exclusive WAL insertion lock right? because if
we want to set the LSN of this record as the checkpoint. redo then
there should not be any concurrent insertion until we expose the
XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all
the record which has been inserted after the checkpoint. redo before
we acquired the exclusive WAL insertion lock.

So maybe I need to restart from the first version of the patch but
instead of moving the insertion of the new record out of the exclusive
lock need to do some better refactoring so that XLogInsertRecord()
doesn't look ugly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#14)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 31, 2023 at 09:55:45AM +0530, Dilip Kumar wrote:

Yeah, good catch. With this, it seems like we can not move this new
WAL Insert out of the Exclusive WAL insertion lock right? Because if
we want to set the LSN of this record as the checkpoint.redo then
there should not be any concurrent insertion until we expose the
XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all
the record which has been inserted after the checkpoint.redo before
we acquired the exclusive WAL insertion lock.

Yes.

So maybe I need to restart from the first version of the patch but
instead of moving the insertion of the new record out of the exclusive
lock need to do some better refactoring so that XLogInsertRecord()
doesn't look ugly.

Yes, I am not sure which interface would be less ugli-ish, but that's
enough material for a refactoring patch of the WAL insert routines on
top of the main patch that introduces the REDO record.
--
Michael

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: New WAL record to detect the checkpoint redo location

On Fri, Jul 14, 2023 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

As Michael Paquier correctly noted downthread, we need to make sure
that a backend inserting a WAL record knows whether it needs to
contain an FPI. The comments in the do...while loop in XLogInsert are
pretty helpful here: doPageWrites can't change once XLogInsertRecord
acquires a WAL insertion lock. For that to be true, the redo pointer
can only move when holding all WAL insertion locks. That means that if
we add an XLOG_CHECKPOINT_REDO to mark the location of the redo
pointer, we've got to either (a) insert the record *and* update our
notion of the last redo pointer while holding all the WAL insertion
locks or (b) change the concurrency model in some way.

Let's explore (b) first. Perhaps my imagination is too limited here,
but I'm having trouble seeing a good way of doing this. One idea that
occurred to me was to make either the insertion of the
XLOG_CHECKPOINT_REDO record fail softly if somebody inserts a record
after it that omits FPIs, but that doesn't really work because then
we're left with a hole in the WAL. It's too late to move the later
record earlier. We could convert the intended XLOG_CHECKPOINT_REDO
record into a dummy record but that seems complex and wasteful.
Similarly, you could try to make the insertion of the later record
fail, but that basically has the same problem: there could be an even
later record being inserted after that which it's already too late to
reposition. Basically, it feels like once we get to the point where we
have a range of LSNs and we're copying data into wal_buffers, it's
awfully late to be trying to back out. Other people can already be
depending on us to put the amount of WAL that we promised to insert at
the place where we promised to put it.

The only other approach to (b) that I can think of is to force FPIs on
for all backends from just before to just after we insert the
XLOG_CHECKPOINT_REDO record. However, since we currently require
taking all the WAL insertion locks to start requiring full page
writes, this doesn't seem like it gains much. In theory perhaps we
could have an approach where we flip full page writes to sorta-on,
then wait until we've seen each WAL insertion lock unheld at least
once, and then at that point we know all new WAL insertions will see
them and can deem them fully on. However, when I start to think along
these lines, I feel like maybe I'm losing the plot. Checkpoints are
rare enough that the overhead of taking all the WAL insertion locks at
the same time isn't really a big problem, or at least I don't think it
is. I think the concern here is more about avoiding useless branches
in hot paths that potentially cost something for every single record
whether it has anything to do with this mechanism or not.

OK, so let's suppose we abandon the idea of changing the concurrency
model in any fundamental way and just try to figure out how to both
insert the record and update our notion of the last redo pointer while
holding all the WAL insertion locks i.e. (a) from the two options
above. Dilip's patch approaches this problem by pulling acquisition of
the WAL insertion locks up to the place where we're already setting
the redo pointer. I wonder if we could also consider the other
possible approach of pushing the update to Insert->RedoRecPtr down
into XLogInsertRecord(), which already has a special case for
acquiring all locks when the record being inserted is an XLOG_SWITCH
record. That would have the advantage of holding all of the insertion
locks for a smaller period of time than what Dilip's patch does -- in
particular, it wouldn't need to hold the lock across the
XLOG_CHECKPOINT_REDO's XLogRecordAssemble -- or across the rather
lengthy tail of XLogInsertRecord. But the obvious objection is that it
would put more branches into XLogInsertRecord which nobody wants.

But ... what if it didn't? Suppose we pulled the XLOG_SWITCH case out
of XLogInsertRecord and made a separate function for that case. It
looks to me like that would avoid 5 branches in that function in the
normal case where we're not inserting XLOG_SWITCH. We would want to
move some logic, particularly the WAL_DEBUG stuff and maybe other
things, into reusable subroutines. Then, we could decide to treat
XLOG_CHECKPOINT_REDO either in the normal path -- adding a couple of
those branches back again -- or in the XLOG_SWITCH function and either
way I think the normal path would have fewer branches than it does
today. One idea that I had was to create a new rmgr for "weird
records," initially XLOG_SWITCH and XLOG_CHECKPOINT_REDO. Then the
test as to whether to use the "normal" version of XLogInsertRecord or
the "weird" version could just be based on the rmid, and the "normal"
function wouldn't need to concern itself with anything specific to the
"weird" cases.

A variant on this idea would be to just accept a few extra branches
and hope it's not really that big of a deal. For instance, instead of
this:

bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
info == XLOG_SWITCH);

We could have this:

bool isAllLocks = (rechdr->xl_rmid == RM_BIZARRE_ID);
bool isLogSwitch = (isAllLocks && info == XLOG_SWITCH);

...and then conditionalize on either isAllLocks or isLogSwitch as
apppropriate. You'd still need an extra branch someplace to update
Insert->RedoRecPtr when isAllLocks && info == XLOG_CHECKPOINT_REDO,
but maybe that's not so bad?

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

The difficulty that I see with this is that the function does a lot
more stuff after calling WALInsertLockRelease(). So just pushing the
part that's between acquiring and releasing WAL insertion locks into a
separate function wouldn't actually avoid a lot of code duplication,
if the goal was to do everything else that XLogInsertRecord() does
except for the lock manipulation. To get there, I think we'd need to
move all of the stuff after the lock release into one or more static
functions, too. Which is possibly an OK approach. I haven't checked
how much additional parameter passing we'd end up doing if we went
that way.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#16)
Re: New WAL record to detect the checkpoint redo location

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

I've tested this to the extent of running the regression tests, and I
also did one (1) manual test where it looked like the right thing was
happening, but that's it, so this might be buggy or perform like
garbage for all I know. But my hope is that it isn't buggy and
performs adequately. If there's any chance of getting some comments on
the basic design choices before I spend time testing and polishing it,
that would be very helpful.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v6-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchapplication/octet-stream; name=v6-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchDownload+56-42
v6-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchapplication/octet-stream; name=v6-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchDownload+13-1
v6-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchapplication/octet-stream; name=v6-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchDownload+95-35
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#17)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 7:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

One minor comment:
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

Isn't the check needs to compare the record type with info?

Your v6-0001* patch looks like an improvement to me even without the
other two patches.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]/messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com. Basically in
[1]: /messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

[1]: /messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

#19Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#18)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

I think the comment needs updating. I don't think we can do curInsert
= XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
Same for Insert->fullPageWrites.

I agree that it looks a little wasteful to release the lock and then
reacquire it, but I suppose checkpoints don't happen often enough for
it to matter. You're not going to notice an extra set of insertion
lock acquisitions once every 5 minutes, or every half hour, or even
every 1 minute if your checkpoints are super-frequent.

Also notice that the current code is also quite inefficient in this
way. GetLastImportantRecPtr() acquires and releases each lock one at a
time, and then we immediately turn around and do
WALInsertLockAcquireExclusive(). If the overhead that you're concerned
about here were big enough to matter, we could reclaim what we're
losing by having a version of GetLastImportantRecPtr() that expects to
be called with all locks already held. But when I asked Andres, he
thought that it didn't matter, and I bet he's right.

One minor comment:
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

Isn't the check needs to compare the record type with info?

Yeah wow. That's a big mistake.

Your v6-0001* patch looks like an improvement to me even without the
other two patches.

Good to know, thanks.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]. Basically in
[1], to allow slots migration from lower to higher version, we need to
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

If your approach requires a code change every time someone adds a new
WAL record that doesn't modify table data, you might want to rethink
the approach a bit.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#17)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 1:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

Yeah, this looks improvement as it removes one isLogSwitch from the code.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

Yeah from a design POV, it looks fine to me because the main goal was
to insert the XLOG_CHECKPOINT_REDO record and set the "RedoRecPtr"
under the same exclusive wal insertion lock and this patch is doing
this. As you already mentioned it is an improvement over my first
patch because a) it holds the exclusive WAL insersion lock for a very
short duration b) not increasing the number of branches in
XLogInsertRecord().

Some review
1.
I feel we can reduce one more branch to the normal path by increasing
one branch in this special case i.e.

Your code is
if (class == WALINSERT_SPECIAL_SWITCH)
{
/*execute isSwitch case */
}
else if (class == WALINSERT_SPECIAL_CHECKPOINT)
{
/*execute checkpoint redo case */
}
else
{
/* common case*/
}

My suggestion
if (xl_rmid == RM_XLOG_ID)
{
if (class == WALINSERT_SPECIAL_SWITCH)
{
/*execute isSwitch case */
}
else if (class == WALINSERT_SPECIAL_CHECKPOINT)
{
/*execute checkpoint redo case */
}
}
else
{
/* common case*/
}

2.
In fact, I feel that we can remove this branch as well right? I mean
why do we need to have this separate thing called "class"? we can
very much use "info" for that purpose. right?

+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

So if we remove this then we do not have this class and the above case
would look like

if (xl_rmid == RM_XLOG_ID)
{
if (info == XLOG_SWITCH)
{
/*execute isSwitch case */
}
else if (info == XLOG_CHECKPOINT_REDO)
{
/*execute checkpoint redo case */
}
}
else
{
/* common case*/
}

3.
+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }
+

the above check-in else if is wrong I mean
else if (XLOG_CHECKPOINT_REDO) should be else if (info == XLOG_CHECKPOINT_REDO)

That's all I have for now.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#17)
#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
#26Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
#27Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#25)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#27)
#31Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#28)
#32Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#38)