Proposal: Limitations of palloc inside checkpointer

Started by Ekaterina Sokolova11 months ago27 messages

e.sokolova@postgrespro.ru

11 months ago

Hi, hackers!

Historically, the checkpointer process use palloc() into
AbsorbSyncRequests() function. Therefore, the checkpointer does not
expect to receive a request larger than 1 GB.

We encountered a case where the database went into recovery state, after
applying all wal, the checkpointer process generated an "invalid memory
alloc request size" error and entered a loop. But it is quite acceptable
for the recovery state to receive such a large allocation request.

A simple solution to this problem is to use palloc_extended() instead of
palloc(). But is it safe to allow the checkpointer to allocate so much
memory at once?

I have proposal to update this memory allocation but I need your ideas
and advices on how to do it in appropriate way. As an idea, we can
replace the array with a list of arrays to allocate memory in chunks. As
a bad idea, we can process a temporary array without locking.

I would be glad to hear your ideas and suggestions about this topic.
Have a nice day!

--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Ekaterina Sokolova (#1)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

On Tue, 25 Feb 2025 at 22:44, Ekaterina Sokolova <e.sokolova@postgrespro.ru>
wrote:

Hi, hackers!

Historically, the checkpointer process use palloc() into
AbsorbSyncRequests() function. Therefore, the checkpointer does not
expect to receive a request larger than 1 GB.

Yeah. And the most unpleasant thing is it won't simply fail with an error
or helpful message suggesting a workaround (reduce the amount of shared
memory). Checkpointer will just "stuck".

AFAICS, we have a few options:
1. Leave it as it is, but fatal on allocation of the chunk more than 1G.
2. Use palloc_extended with MCXT_ALLOC_HUGE flag.
3. Do not use any allocation and use CheckpointerShmem->requests directly
in case of > 1G size of the required allocation.

Case (3) is not an option, in my opinion. So, we following (1) or (2).
Personally, I'm for (2), PFA v0 patch.

--
Best regards,
Maxim Orlov.

Attachments:

v0-0001-Expect-huge-number-of-requests-in-checkpointer.patchapplication/octet-stream; name=v0-0001-Expect-huge-number-of-requests-in-checkpointer.patchDownload

From 3812a3e497a69fc083ea88d10c05614319f350ec Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Wed, 26 Feb 2025 11:17:50 +0300
Subject: [PATCH v0] =?UTF-8?q?Expect=20huge=20number=20of=20=E2=80=8B?=
 =?UTF-8?q?=E2=80=8Brequests=20in=20checkpointer?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 src/backend/postmaster/checkpointer.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e26..17635bc018 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1232,7 +1232,8 @@ CompactCheckpointerRequestQueue(void)
 		return false;
 
 	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+	skip_slot = palloc_extended(sizeof(bool) * CheckpointerShmem->num_requests,
+								MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
 
 	/* Initialize temporary hash table */
 	ctl.keysize = sizeof(CheckpointerRequest);
@@ -1343,8 +1344,11 @@ AbsorbSyncRequests(void)
 	n = CheckpointerShmem->num_requests;
 	if (n > 0)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		Size	size;
+
+		size = n * sizeof(CheckpointerRequest);
+		requests = palloc_extended(size, MCXT_ALLOC_HUGE);
+		memcpy(requests, CheckpointerShmem->requests, size);
 	}
 
 	START_CRIT_SECTION();
-- 
2.43.0

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Maxim Orlov (#2)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

On 2025-02-26 11:46:45 +0300, Maxim Orlov wrote:

On Tue, 25 Feb 2025 at 22:44, Ekaterina Sokolova <e.sokolova@postgrespro.ru>
wrote:

Hi, hackers!

Historically, the checkpointer process use palloc() into
AbsorbSyncRequests() function. Therefore, the checkpointer does not
expect to receive a request larger than 1 GB.

Yeah. And the most unpleasant thing is it won't simply fail with an error
or helpful message suggesting a workaround (reduce the amount of shared
memory). Checkpointer will just "stuck".

AFAICS, we have a few options:
1. Leave it as it is, but fatal on allocation of the chunk more than 1G.
2. Use palloc_extended with MCXT_ALLOC_HUGE flag.
3. Do not use any allocation and use CheckpointerShmem->requests directly
in case of > 1G size of the required allocation.

4) Do compaction incrementally, instead of doing it for all requests at once.

That'd probably be better, because

a) it'll take some time to to compact 10s to 100s of million requests, which
makes it much more likely that backends will have to perform syncs
themselves and the lock will be held for an extended period of time

b) allocating gigabytes of memory obviously makes it more likely that you'll
fail with out-of-memory at runtime or evne get OOM killed.

Greetings,

Andres Freund

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Andres Freund (#3)

Re: Proposal: Limitations of palloc inside checkpointer

On Wed, 26 Feb 2025 at 11:54, Andres Freund <andres@anarazel.de> wrote:

4) Do compaction incrementally, instead of doing it for all requests at
once.

Yeah, good idea! I completely forgot about that. Thanks!

--
Best regards,
Maxim Orlov.

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Maxim Orlov (#4)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

I tried to implement the idea (4). This is the patch.

But, there is a problem. See, when we release lock and call
RememberSyncRequest() and acquire it again,
CompactCheckpointerRequestQueue() may be called and the state of the
request array may be changed. Of course, we can recheck num_requests after
retaking the lock and restart the algo all over again. But this is not a
great idea, since we can stuck in this loop if someone is pushing requests
in the queue.

As for case (3). In fact, the described problem happens only with high
enough values of NBuffers. Thus, user already expected postgres to use huge
amount of RAM. Is this really a problem if he will demand some more to
process sync request?

--
Best regards,
Maxim Orlov.

Attachments:

v1-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patchapplication/octet-stream; name=v1-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patchDownload

From 6c11de18f460232f7064a56aff3413406c824ff3 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 27 Feb 2025 12:59:51 +0300
Subject: [PATCH v1] AbsorbSyncRequests incrementally, instead of doing it for
 all requests at once.

---
 src/backend/postmaster/checkpointer.c | 59 +++++++++++++++++----------
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e267..a4c6544e13b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1321,42 +1321,57 @@ CompactCheckpointerRequestQueue(void)
 void
 AbsorbSyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
+	CheckpointerRequest	   *requests = NULL;
+	int						num_requests,
+							max_requests;
 
 	if (!AmCheckpointerProcess())
 		return;
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
+	num_requests = CheckpointerShmem->num_requests;
 	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
+	 * Note: this can be chosen arbitrarily, stick for 1Gb for now.
 	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	max_requests = MaxAllocSize / sizeof(CheckpointerRequest);
+
+	for (int i = 0; i < num_requests; i += max_requests)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		Size	n = (num_requests - i >= max_requests) ? max_requests :
+														 num_requests - i;
 
-	START_CRIT_SECTION();
+		if (!requests)
+			requests = palloc(n * sizeof(CheckpointerRequest));
 
-	CheckpointerShmem->num_requests = 0;
+		memcpy(requests, CheckpointerShmem->requests + i,
+			   n * sizeof(CheckpointerRequest));
 
-	LWLockRelease(CheckpointerCommLock);
+		LWLockRelease(CheckpointerCommLock);
+
+		START_CRIT_SECTION();
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable runs
+		 * out of memory).  This is because the system cannot run safely if we
+		 * are unable to fsync what we have been told to fsync.  Fortunately,
+		 * the hashtable is so small that the problem is quite unlikely to arise
+		 * in practice.
+		 */
+		for (int j = 0; j < n; j++)
+			RememberSyncRequest(&requests[j].ftag, requests[j].type);
+
+		END_CRIT_SECTION();
 
-	END_CRIT_SECTION();
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+	}
+
+	CheckpointerShmem->num_requests = 0;
+	LWLockRelease(CheckpointerCommLock);
 
 	if (requests)
 		pfree(requests);
-- 
2.43.0

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Maxim Orlov (#5)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

After done some testing, I found a bug in the patch. If more requests were
pushed while we release the lock, num_requests could not be set to zero.

Here is a fixed version.

--
Best regards,
Maxim Orlov.

Attachments:

v2-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patchapplication/octet-stream; name=v2-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patchDownload

From e1b47ed47508d2eeb3e3a2ca2424edb56eae4dec Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 27 Feb 2025 12:59:51 +0300
Subject: [PATCH v2] AbsorbSyncRequests incrementally, instead of doing it for
 all requests at once.

---
 src/backend/postmaster/checkpointer.c | 80 +++++++++++++++++++--------
 1 file changed, 57 insertions(+), 23 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e26..bdb0de4176 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1321,42 +1321,76 @@ CompactCheckpointerRequestQueue(void)
 void
 AbsorbSyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
+	CheckpointerRequest	   *requests = NULL;
+	int						num_requests,
+							max_requests;
 
 	if (!AmCheckpointerProcess())
 		return;
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
+	num_requests = CheckpointerShmem->num_requests;
 	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
+	 * Note: this can be chosen arbitrarily, stick for 1Gb for now.
 	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	max_requests = MaxAllocSize / sizeof(CheckpointerRequest);
+
+	for (int i = 0; i < num_requests; i += max_requests)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		Size	n = (num_requests - i >= max_requests) ? max_requests :
+														 num_requests - i;
 
-	START_CRIT_SECTION();
+		if (!requests)
+			requests = palloc(n * sizeof(CheckpointerRequest));
 
-	CheckpointerShmem->num_requests = 0;
+		memcpy(requests, CheckpointerShmem->requests + i,
+			   n * sizeof(CheckpointerRequest));
 
-	LWLockRelease(CheckpointerCommLock);
+		LWLockRelease(CheckpointerCommLock);
+
+		START_CRIT_SECTION();
+
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable runs
+		 * out of memory).  This is because the system cannot run safely if we
+		 * are unable to fsync what we have been told to fsync.  Fortunately,
+		 * the hashtable is so small that the problem is quite unlikely to arise
+		 * in practice.
+		 */
+		for (int j = 0; j < n; j++)
+			RememberSyncRequest(&requests[j].ftag, requests[j].type);
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		END_CRIT_SECTION();
 
-	END_CRIT_SECTION();
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+	}
+
+	if (CheckpointerShmem->num_requests >= num_requests)
+	{
+		/*
+		 * All the requests that were added to the queue between the release and
+		 * the lock acquisition must be processed.
+		*/
+		Size	n = CheckpointerShmem->num_requests - num_requests;
+
+		memmove(CheckpointerShmem->requests,
+				CheckpointerShmem->requests + num_requests,
+				n * sizeof(CheckpointerRequest));
+		CheckpointerShmem->num_requests = n;
+	}
+	else
+	{
+		/*
+		 * CompactCheckpointerRequestQueue was called, found some duplicates and
+		 * remove them.
+		 */
+	}
+
+	LWLockRelease(CheckpointerCommLock);
 
 	if (requests)
 		pfree(requests);
-- 
2.48.1

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Maxim Orlov (#6)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Here is an alternative solution. We can limit the number of processed
requests to fit in a 1Gb memory chunk for each pass. Obviously, we left
some requests in the queue to be processed in the next call. We can
overcome this by using multi-step processing: estimating the number of
steps in the beginning and processing requests again.

I'd like to hear your opinion on the subject.

--
Best regards,
Maxim Orlov.

Attachments:

v3-0001-Limit-AbsorbSyncRequests-to-1Gb-at-once.patchapplication/octet-stream; name=v3-0001-Limit-AbsorbSyncRequests-to-1Gb-at-once.patchDownload

From a349b63467ddab60c7aadb8a2f1f5264efe5d165 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 28 Feb 2025 11:18:19 +0300
Subject: [PATCH v3] Limit AbsorbSyncRequests to 1Gb at once.

---
 src/backend/postmaster/checkpointer.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e26..3ca4f1fb68 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1339,8 +1339,12 @@ AbsorbSyncRequests(void)
 	 * memory).  This is because the system cannot run safely if we are unable
 	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
 	 * is so small that the problem is quite unlikely to arise in practice.
+	 *
+	 * Note: we could not palloc more than 1Gb of memory, thus make sure that
+	 * the maximum number of elements will fit in the requests buffer.
 	 */
-	n = CheckpointerShmem->num_requests;
+	n = Min(CheckpointerShmem->num_requests,
+			MaxAllocSize / sizeof(CheckpointerRequest));
 	if (n > 0)
 	{
 		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
@@ -1349,7 +1353,11 @@ AbsorbSyncRequests(void)
 
 	START_CRIT_SECTION();
 
-	CheckpointerShmem->num_requests = 0;
+	CheckpointerShmem->num_requests -= n;
+
+	memmove(CheckpointerShmem->requests,
+			CheckpointerShmem->requests + n,
+			CheckpointerShmem->num_requests * sizeof(CheckpointerRequest));
 
 	LWLockRelease(CheckpointerCommLock);
 
-- 
2.48.1

Maxim Orlov

orlovmg@gmail.com

11 months ago

In reply to: Maxim Orlov (#7)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

I think I figured it out. Here is v4.

If the number of requests is less than 1 GB, the algorithm stays the same
as before. If we need to process more, we will do it incrementally with
slices of 1 GB.

Best regards,
Maxim Orlov.

Attachments:

v4-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchapplication/octet-stream; name=v4-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchDownload

From 0d1f0f67557679cb20d53096d12fa22de7760fe5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 28 Feb 2025 17:06:01 +0300
Subject: [PATCH v4] Process sync requests incrementally in AbsorbSyncRequests

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 Gb of memory,
resulting in failure. This will lead to an infinite loop in the checkpointer
process.

So, we process requests incrementally with a buffer of not more than 1 Gb on
each step.
---
 src/backend/postmaster/checkpointer.c | 63 +++++++++++++++++----------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e26..de2e99844b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1324,39 +1324,56 @@ AbsorbSyncRequests(void)
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
 	int			n;
+	bool		loop;
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	do
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	START_CRIT_SECTION();
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable runs
+		 * out of memory).  This is because the system cannot run safely if we
+		 * are unable to fsync what we have been told to fsync.  Fortunately,
+		 * the hashtable is so small that the problem is quite unlikely to arise
+		 * in practice.
+		 *
+		 * Note: we could not palloc more than 1Gb of memory, thus make sure
+		 * that the maximum number of elements will fit in the requests buffer.
+		 */
+		n = Min(CheckpointerShmem->num_requests,
+				MaxAllocSize / sizeof(CheckpointerRequest));
+		if (n > 0)
+		{
+			if (!requests)
+				requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
+			memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
 
-	CheckpointerShmem->num_requests = 0;
+			CheckpointerShmem->num_requests -= n;
 
-	LWLockRelease(CheckpointerCommLock);
+			memmove(CheckpointerShmem->requests,
+					CheckpointerShmem->requests + n,
+					CheckpointerShmem->num_requests * sizeof(CheckpointerRequest));
+		}
+
+		START_CRIT_SECTION();
+
+		/* Are there any requests in the queue? If so, keep going. */
+		loop = CheckpointerShmem->num_requests != 0;
+
+		LWLockRelease(CheckpointerCommLock);
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		for (request = requests; n > 0; request++, n--)
+			RememberSyncRequest(&request->ftag, request->type);
 
-	END_CRIT_SECTION();
+		END_CRIT_SECTION();
+	} while (loop);
 
 	if (requests)
 		pfree(requests);
-- 
2.48.1

Xuneng Zhou

xunengzhou@gmail.com

10 months ago

In reply to: Maxim Orlov (#8)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,
The patch itself looks ok to me. I'm curious about the trade-offs between
this incremental approach and the alternative of
using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach of
splitting the requests into fixed-size slices avoids OOM failures or
process termination by the OOM killer, which is good. However, it does add
some overhead with additional lock acquisition/release cycles and memory
movement operations via memmove(). The natural question is whether the
security justify the cost. Regarding the slice size of 1 GB, is this
derived from MaxAllocSize limit, or was it chosen for other performance
reasons? whether a different size might offer better performance under
typical workloads?

It would be helpful to know the reasoning behind these design decisions.

Maxim Orlov <orlovmg@gmail.com> 于2025年3月1日周六 00:54写道：

Show quoted text

I think I figured it out. Here is v4.

If the number of requests is less than 1 GB, the algorithm stays the same
as before. If we need to process more, we will do it incrementally with
slices of 1 GB.

Best regards,
Maxim Orlov.

#10

Maxim Orlov

orlovmg@gmail.com

10 months ago

In reply to: Xuneng Zhou (#9)

Re: Proposal: Limitations of palloc inside checkpointer

On Wed, 12 Mar 2025 at 10:27, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,
The patch itself looks ok to me. I'm curious about the trade-offs between
this incremental approach and the alternative of
using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach of
splitting the requests into fixed-size slices avoids OOM failures or
process termination by the OOM killer, which is good. However, it does add
some overhead with additional lock acquisition/release cycles and memory
movement operations via memmove(). The natural question is whether the
security justify the cost. Regarding the slice size of 1 GB, is this
derived from MaxAllocSize limit, or was it chosen for other performance
reasons? whether a different size might offer better performance under
typical workloads?

I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
relatively old one, and no one expected the number of requests to exceed 1
GB. Now we have the ability to set the shared_buffers to a huge number
(without discussing now whether this makes any real sense), thus this limit
for palloc becomes a problem.

--
Best regards,
Maxim Orlov.

#11

Heikki Linnakangas

hlinnaka@iki.fi

9 months ago

In reply to: Maxim Orlov (#10)

Re: Proposal: Limitations of palloc inside checkpointer

On 12/03/2025 13:00, Maxim Orlov wrote:

On Wed, 12 Mar 2025 at 10:27, Xuneng Zhou <xunengzhou@gmail.com
<mailto:xunengzhou@gmail.com>> wrote:

The patch itself looks ok to me. I'm curious about the trade-offs
between this incremental approach and the alternative of
using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
of splitting the requests into fixed-size slices avoids OOM
failures or process termination by the OOM killer, which is good.

However, it does add some overhead with additional lock acquisition/
release cycles and memory movement operations via memmove(). The
natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe
10000 entries or even less. That should be enough to keep the lock
acquire/release overhead acceptable.

Regarding the slice size of 1 GB, is this derived from
MaxAllocSize limit, or was it chosen for other performance
reasons? whether a different size might offer better performance
under typical workloads?

I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
relatively old one, and no one expected the number of requests to exceed
1 GB. Now we have the ability to set the shared_buffers to a huge number
(without discussing now whether this makes any real sense), thus this
limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests =
NBuffers, which makes some sense so that you can fit the worst case
scenario of quickly evicting all pages from the buffer cache. But even
then I'd expect the checkpointer to be able to absorb the changes before
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too,
BTW, although much smaller than the one in AbsorbSyncRequests(). The
hash table it builds could grow quite large though.

--
Heikki Linnakangas
Neon (https://neon.tech)

#12

Xuneng Zhou

xunengzhou@gmail.com

9 months ago

In reply to: Heikki Linnakangas (#11)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

I’ve updated and rebased Maxim's patch version 4 with optimizations like
ring buffer and capped max_requests proposed by Heikki. These are the
summary of Changes in v5:

• Replaced the linear array model in AbsorbSyncRequests() with a bounded
ring buffer to avoid large memmove() operations when processing sync
requests.

• Introduced a tunable batch size (CKPT_REQ_BATCH_SIZE, default:
10,000) to incrementally
absorb requests while respecting MaxAllocSize.

• Capped max_requests with a hard upper bound (MAX_CHECKPOINT_REQUESTS =
10,000,000) to avoid pathological memory growth when shared_buffers is very
large.

• Updated CompactCheckpointerRequestQueue() to support the ring buffer
layout.

• Retained the existing compacting logic but modified it to operate
in-place over the ring.

The patch itself looks ok to me. I'm curious about the trade-offs
between this incremental approach and the alternative of
using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
of splitting the requests into fixed-size slices avoids OOM
failures or process termination by the OOM killer, which is good.

+1

However, it does add some overhead with additional lock acquisition/
release cycles and memory movement operations via memmove(). The
natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe
10000 entries or even less. That should be enough to keep the lock
acquire/release overhead acceptable

Regarding the slice size of 1 GB, is this derived from
MaxAllocSize limit, or was it chosen for other performance
reasons? whether a different size might offer better performance
under typical workloads?

I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
relatively old one, and no one expected the number of requests to exceed
1 GB. Now we have the ability to set the shared_buffers to a huge number
(without discussing now whether this makes any real sense), thus this
limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests =
NBuffers, which makes some sense so that you can fit the worst case
scenario of quickly evicting all pages from the buffer cache. But even
then I'd expect the checkpointer to be able to absorb the changes before
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too,
BTW, although much smaller than the one in AbsorbSyncRequests(). The
hash table it builds could grow quite large though.

I’m not certain what the optimal cap for max_requests should be—whether

it’s 10 million(setting it to 10 million would result in a peak temporary
allocation of roughly 700 MB within CompactCheckpointerRequestQueue), 1
million, or some other value—but introducing an upper bound seems
important. Without a cap, max_requests could theoretically scale with
NBuffers, potentially resulting in excessive temporary memory allocations.

As this is my first patch submission, it might be somewhat naive and
experimental— I appreciate your patience and feedback.

Attachments:

v5-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patchapplication/octet-stream; name=v5-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patchDownload

From 6ecd080e9b8e735645b32ac42a65f37d751af599 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 10 Apr 2025 21:39:29 +0800
Subject: [PATCH v5] Process sync requests incrementally in AbsorbSyncRequests

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure. This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

To avoid this, we process requests incrementally in batches. This patch
introduces a bounded ring buffer for storing checkpointer sync requests,
ensuring that memory usage stays within a safe range and avoiding
excessive allocations.

Original author: Maxim Orlov <orlovmg@gmail.com>
Patch updated and revised by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/postmaster/checkpointer.c | 152 +++++++++++++++++++-------
 1 file changed, 111 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d3cb3f1891c..914454d909e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,11 @@ typedef struct
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
+
+	int			head;			/* Index of the first request in the ring
+								 * buffer */
+	int			tail;			/* Index of the last request in the ring
+								 * buffer */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -135,6 +140,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /* interval for calling AbsorbSyncRequests in CheckpointWriteDelay */
 #define WRITES_PER_ABSORB		1000
 
+/* Maximum number of checkpointer requests to process in one batch */
+#define CKPT_REQ_BATCH_SIZE 10000
+
+/* Max number of requests the checkpointer request queue can hold */
+#define MAX_CHECKPOINT_REQUESTS 10000000
+
 /*
  * GUC parameters
  */
@@ -974,7 +985,8 @@ CheckpointerShmemInit(void)
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
+		CheckpointerShmem->head = CheckpointerShmem->tail = 0;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
 	}
@@ -1152,6 +1164,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
+	int			insert_pos;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1175,10 +1188,14 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	}
 
 	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	insert_pos = CheckpointerShmem->tail;
+	request = &CheckpointerShmem->requests[insert_pos];
 	request->ftag = *ftag;
 	request->type = type;
 
+	CheckpointerShmem->tail = (CheckpointerShmem->tail + 1) % CheckpointerShmem->max_requests;
+	CheckpointerShmem->num_requests++;
+
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
 				CheckpointerShmem->max_requests / 2);
@@ -1220,12 +1237,17 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int			ring_idx;
 	};
 
-	int			n,
-				preserve_count;
+	int			n;
 	int			num_skipped = 0;
+	int			head,
+				tail;
+	int			max_requests;
+	int			num_requests;
+	int			read_idx,
+				write_idx;
 	HASHCTL		ctl;
 	HTAB	   *htab;
 	bool	   *skip_slot;
@@ -1237,8 +1259,14 @@ CompactCheckpointerRequestQueue(void)
 	if (CritSectionCount > 0)
 		return false;
 
+	max_requests = CheckpointerShmem->max_requests;
+	num_requests = CheckpointerShmem->num_requests;
+
 	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+	skip_slot = palloc0(sizeof(bool) * max_requests);
+
+	head = CheckpointerShmem->head;
+	tail = CheckpointerShmem->tail;
 
 	/* Initialize temporary hash table */
 	ctl.keysize = sizeof(CheckpointerRequest);
@@ -1262,7 +1290,8 @@ CompactCheckpointerRequestQueue(void)
 	 * away preceding entries that would end up being canceled anyhow), but
 	 * it's not clear that the extra complexity would buy us anything.
 	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
 		CheckpointerRequest *request;
 		struct CheckpointerSlotMapping *slotmap;
@@ -1275,16 +1304,19 @@ CompactCheckpointerRequestQueue(void)
 		 * CheckpointerShmemInit.  Note also that RelFileLocator had better
 		 * contain no pad bytes.
 		 */
-		request = &CheckpointerShmem->requests[n];
+		request = &CheckpointerShmem->requests[read_idx];
 		slotmap = hash_search(htab, request, HASH_ENTER, &found);
 		if (found)
 		{
 			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
+			skip_slot[slotmap->ring_idx] = true;
 			num_skipped++;
 		}
 		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
+		slotmap->ring_idx = read_idx;
+
+		/* Move to the next request in the ring buffer */
+		read_idx = (read_idx + 1) % max_requests;
 	}
 
 	/* Done with the hash table. */
@@ -1298,17 +1330,34 @@ CompactCheckpointerRequestQueue(void)
 	}
 
 	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = write_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		/* If this slot is NOT skipped, keep it */
+		if (!skip_slot[read_idx])
+		{
+			/* If the read and write positions are different, copy the request */
+			if (write_idx != read_idx)
+				CheckpointerShmem->requests[write_idx] =
+					CheckpointerShmem->requests[read_idx];
+
+			/* Advance the write position */
+			write_idx = (write_idx + 1) % max_requests;
+		}
+
+		read_idx = (read_idx + 1) % max_requests;
 	}
+
+	/*
+	 * Update ring buffer state: head remains the same, tail moves, count
+	 * decreases
+	 */
+	CheckpointerShmem->tail = write_idx;
+	CheckpointerShmem->num_requests -= num_skipped;
+
 	ereport(DEBUG1,
 			(errmsg_internal("compacted fsync request queue from %d entries to %d entries",
-							 CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
+							 num_requests, CheckpointerShmem->num_requests)));
 
 	/* Cleanup. */
 	pfree(skip_slot);
@@ -1329,40 +1378,61 @@ AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
-	int			n;
+	int			n,
+				i;
+	bool		loop;
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	do
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the
+		 * lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable
+		 * runs out of memory).  This is because the system cannot run safely
+		 * if we are unable to fsync what we have been told to fsync.
+		 * Fortunately, the hashtable is so small that the problem is quite
+		 * unlikely to arise in practice.
+		 *
+		 * Note: we could not palloc more than 1Gb of memory, thus make sure
+		 * that the maximum number of elements will fit in the requests
+		 * buffer.
+		 */
+		n = Min(CheckpointerShmem->num_requests, CKPT_REQ_BATCH_SIZE);
+		if (n > 0)
+		{
+			if (!requests)
+				requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
 
-	START_CRIT_SECTION();
+			for (i = 0; i < n; i++)
+			{
+				requests[i] = CheckpointerShmem->requests[CheckpointerShmem->head];
+				CheckpointerShmem->head = (CheckpointerShmem->head + 1) % CheckpointerShmem->max_requests;
+			}
 
-	CheckpointerShmem->num_requests = 0;
+			CheckpointerShmem->num_requests -= n;
 
-	LWLockRelease(CheckpointerCommLock);
+		}
+
+		START_CRIT_SECTION();
+
+		/* Are there any requests in the queue? If so, keep going. */
+		loop = CheckpointerShmem->num_requests != 0;
+
+		LWLockRelease(CheckpointerCommLock);
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		for (request = requests; n > 0; request++, n--)
+			RememberSyncRequest(&request->ftag, request->type);
 
-	END_CRIT_SECTION();
+		END_CRIT_SECTION();
+	} while (loop);
 
 	if (requests)
 		pfree(requests);
-- 
2.48.1

#13

Xuneng Zhou

xunengzhou@gmail.com

9 months ago

In reply to: Xuneng Zhou (#12)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

I've moved this entry to the July CommitFest. The CI reported an unused
variable warning in patch v5, so I've addressed it by removing the unused
one. Sorry for not catching the issue locally.
Xuneng Zhou <xunengzhou@gmail.com> 于2025年4月10日周四 23:43写道：

Show quoted text

Hi,

I’ve updated and rebased Maxim's patch version 4 with optimizations like
ring buffer and capped max_requests proposed by Heikki. These are the
summary of Changes in v5:

• Replaced the linear array model in AbsorbSyncRequests() with a bounded
ring buffer to avoid large memmove() operations when processing sync
requests.

• Introduced a tunable batch size (CKPT_REQ_BATCH_SIZE, default: 10,000)
to incrementally absorb requests while respecting MaxAllocSize.

• Capped max_requests with a hard upper bound (MAX_CHECKPOINT_REQUESTS =
10,000,000) to avoid pathological memory growth when shared_buffers is
very large.

• Updated CompactCheckpointerRequestQueue() to support the ring buffer
layout.

• Retained the existing compacting logic but modified it to operate
in-place over the ring.

The patch itself looks ok to me. I'm curious about the trade-offs
between this incremental approach and the alternative of
using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
of splitting the requests into fixed-size slices avoids OOM
failures or process termination by the OOM killer, which is good.

+1

However, it does add some overhead with additional lock acquisition/
release cycles and memory movement operations via memmove(). The
natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe
10000 entries or even less. That should be enough to keep the lock
acquire/release overhead acceptable

Regarding the slice size of 1 GB, is this derived from
MaxAllocSize limit, or was it chosen for other performance
reasons? whether a different size might offer better performance
under typical workloads?

I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
relatively old one, and no one expected the number of requests to

exceed

1 GB. Now we have the ability to set the shared_buffers to a huge

number

(without discussing now whether this makes any real sense), thus this
limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests =
NBuffers, which makes some sense so that you can fit the worst case
scenario of quickly evicting all pages from the buffer cache. But even
then I'd expect the checkpointer to be able to absorb the changes before
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests
now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too,
BTW, although much smaller than the one in AbsorbSyncRequests(). The
hash table it builds could grow quite large though.

I’m not certain what the optimal cap for max_requests should be—whether

it’s 10 million(setting it to 10 million would result in a peak temporary
allocation of roughly 700 MB within CompactCheckpointerRequestQueue), 1
million, or some other value—but introducing an upper bound seems
important. Without a cap, max_requests could theoretically scale with
NBuffers, potentially resulting in excessive temporary memory allocations.

As this is my first patch submission, it might be somewhat naive and
experimental— I appreciate your patience and feedback.

Attachments:

v6-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patchapplication/octet-stream; name=v6-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patchDownload

From 6ecd080e9b8e735645b32ac42a65f37d751af599 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Thu, 10 Apr 2025 21:39:29 +0800
Subject: [PATCH v5] Process sync requests incrementally in AbsorbSyncRequests

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure. This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

To avoid this, we process requests incrementally in batches. This patch
introduces a bounded ring buffer for storing checkpointer sync requests,
ensuring that memory usage stays within a safe range and avoiding
excessive allocations.

Original author: Maxim Orlov <orlovmg@gmail.com>
Patch updated and revised by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/postmaster/checkpointer.c | 150 ++++++++++++++++++-------
 1 file changed, 109 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d3cb3f1891c..914454d909e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,11 @@ typedef struct

 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
+
+	int			head;			/* Index of the first request in the ring
+								 * buffer */
+	int			tail;			/* Index of the last request in the ring
+								 * buffer */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;

@@ -135,6 +140,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /* interval for calling AbsorbSyncRequests in CheckpointWriteDelay */
 #define WRITES_PER_ABSORB		1000

+/* Maximum number of checkpointer requests to process in one batch */
+#define CKPT_REQ_BATCH_SIZE 10000
+
+/* Max number of requests the checkpointer request queue can hold */
+#define MAX_CHECKPOINT_REQUESTS 10000000
+
 /*
  * GUC parameters
  */
@@ -974,7 +985,8 @@ CheckpointerShmemInit(void)
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
+		CheckpointerShmem->head = CheckpointerShmem->tail = 0;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
 	}
@@ -1152,6 +1164,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
+	int			insert_pos;

 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1175,10 +1188,14 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	}

 	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	insert_pos = CheckpointerShmem->tail;
+	request = &CheckpointerShmem->requests[insert_pos];
 	request->ftag = *ftag;
 	request->type = type;

+	CheckpointerShmem->tail = (CheckpointerShmem->tail + 1) % CheckpointerShmem->max_requests;
+	CheckpointerShmem->num_requests++;
+
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
 				CheckpointerShmem->max_requests / 2);
@@ -1220,12 +1237,16 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int			ring_idx;
 	};

-	int			n,
-				preserve_count;
+	int			n;
 	int			num_skipped = 0;
+	int			head;
+	int			max_requests;
+	int			num_requests;
+	int			read_idx,
+				write_idx;
 	HASHCTL		ctl;
 	HTAB	   *htab;
 	bool	   *skip_slot;
@@ -1237,8 +1258,13 @@ CompactCheckpointerRequestQueue(void)
 	if (CritSectionCount > 0)
 		return false;

+	max_requests = CheckpointerShmem->max_requests;
+	num_requests = CheckpointerShmem->num_requests;
+
 	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+	skip_slot = palloc0(sizeof(bool) * max_requests);
+
+	head = CheckpointerShmem->head;

 	/* Initialize temporary hash table */
 	ctl.keysize = sizeof(CheckpointerRequest);
@@ -1262,7 +1288,8 @@ CompactCheckpointerRequestQueue(void)
 	 * away preceding entries that would end up being canceled anyhow), but
 	 * it's not clear that the extra complexity would buy us anything.
 	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
 		CheckpointerRequest *request;
 		struct CheckpointerSlotMapping *slotmap;
@@ -1275,16 +1302,19 @@ CompactCheckpointerRequestQueue(void)
 		 * CheckpointerShmemInit.  Note also that RelFileLocator had better
 		 * contain no pad bytes.
 		 */
-		request = &CheckpointerShmem->requests[n];
+		request = &CheckpointerShmem->requests[read_idx];
 		slotmap = hash_search(htab, request, HASH_ENTER, &found);
 		if (found)
 		{
 			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
+			skip_slot[slotmap->ring_idx] = true;
 			num_skipped++;
 		}
 		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
+		slotmap->ring_idx = read_idx;
+
+		/* Move to the next request in the ring buffer */
+		read_idx = (read_idx + 1) % max_requests;
 	}

 	/* Done with the hash table. */
@@ -1298,17 +1328,34 @@ CompactCheckpointerRequestQueue(void)
 	}

 	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = write_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		/* If this slot is NOT skipped, keep it */
+		if (!skip_slot[read_idx])
+		{
+			/* If the read and write positions are different, copy the request */
+			if (write_idx != read_idx)
+				CheckpointerShmem->requests[write_idx] =
+					CheckpointerShmem->requests[read_idx];
+
+			/* Advance the write position */
+			write_idx = (write_idx + 1) % max_requests;
+		}
+
+		read_idx = (read_idx + 1) % max_requests;
 	}
+
+	/*
+	 * Update ring buffer state: head remains the same, tail moves, count
+	 * decreases
+	 */
+	CheckpointerShmem->tail = write_idx;
+	CheckpointerShmem->num_requests -= num_skipped;
+
 	ereport(DEBUG1,
 			(errmsg_internal("compacted fsync request queue from %d entries to %d entries",
-							 CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
+							 num_requests, CheckpointerShmem->num_requests)));

 	/* Cleanup. */
 	pfree(skip_slot);
@@ -1329,40 +1376,61 @@ AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
-	int			n;
+	int			n,
+				i;
+	bool		loop;

 	if (!AmCheckpointerProcess())
 		return;

-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	do
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the
+		 * lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable
+		 * runs out of memory).  This is because the system cannot run safely
+		 * if we are unable to fsync what we have been told to fsync.
+		 * Fortunately, the hashtable is so small that the problem is quite
+		 * unlikely to arise in practice.
+		 *
+		 * Note: we could not palloc more than 1Gb of memory, thus make sure
+		 * that the maximum number of elements will fit in the requests
+		 * buffer.
+		 */
+		n = Min(CheckpointerShmem->num_requests, CKPT_REQ_BATCH_SIZE);
+		if (n > 0)
+		{
+			if (!requests)
+				requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
+
+			for (i = 0; i < n; i++)
+			{
+				requests[i] = CheckpointerShmem->requests[CheckpointerShmem->head];
+				CheckpointerShmem->head = (CheckpointerShmem->head + 1) % CheckpointerShmem->max_requests;
+			}
+
+			CheckpointerShmem->num_requests -= n;
+
+		}
+
+		START_CRIT_SECTION();
+
+		/* Are there any requests in the queue? If so, keep going. */
+		loop = CheckpointerShmem->num_requests != 0;
+
+		LWLockRelease(CheckpointerCommLock);
+
+		for (request = requests; n > 0; request++, n--)
+			RememberSyncRequest(&request->ftag, request->type);
+
+		END_CRIT_SECTION();
+	} while (loop);

-	START_CRIT_SECTION();
-
-	CheckpointerShmem->num_requests = 0;
-
-	LWLockRelease(CheckpointerCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
-
-	END_CRIT_SECTION();
-
 	if (requests)
 		pfree(requests);
 -- 
2.48.1

#14

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Xuneng Zhou (#13)

Re: Proposal: Limitations of palloc inside checkpointer

Hi, Xuneng Zhou!

On Tue, Apr 15, 2025 at 7:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

I've moved this entry to the July CommitFest. The CI reported an unused variable warning in patch v5, so I've addressed it by removing the unused one. Sorry for not catching the issue locally.

Thank you for your work on this subject!

I have few notes about that:
1) Should we make CompactCheckpointerRequestQueue() process the queue
of checkpoint requests in smaller parts for the same reason we do this
in AbsorbSyncRequests()? That would require significant redesign of
the algorithm, but still.
2) That's pretty independent to the changes by the patch, but should
CompactCheckpointerRequestQueue() fill the gaps with entries from the
tail instead of rewriting the whole queue? That might be a bit
faster.
3) For sure, we wouldn't backpatch this. Can we prepare some simple
solution for back branches? Perhaps, just introduction of
MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
than 1GB.

------
Regards,
Alexander Korotkov
Supabase

#15

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Alexander Korotkov (#14)

Re: Proposal: Limitations of palloc inside checkpointer

Hi Alexander,

Thanks a lot for reviewing!

I have few notes about that:
1) Should we make CompactCheckpointerRequestQueue() process the queue
of checkpoint requests in smaller parts for the same reason we do this
in AbsorbSyncRequests()? That would require significant redesign of
the algorithm, but still.

In AbsorbSyncRequests, we process requests incrementally in batches to
avoid allocating more than 1 GB of memory, which would lead to
repeated failure. I think this is less concerning in
CompactCheckpointerRequestQueue, because if we caps num_requests at 10
million, the hash table peaks at ~500 MB and skip_slot[] at ~10
MB—both under 1 GB.

2) That's pretty independent to the changes by the patch, but should
CompactCheckpointerRequestQueue() fill the gaps with entries from the
tail instead of rewriting the whole queue? That might be a bit
faster.

This optimization would be quite helpful for compacting large queues.
For small ones, it may also add extra costs. Can we use a hybrid
approach? If it's independent, should we create a standalone patch for
it?

3) For sure, we wouldn't backpatch this. Can we prepare some simple
solution for back branches? Perhaps, just introduction of
MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
than 1GB.

I think this would work well for back branches.

#16

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Xuneng Zhou (#15)

Re: Proposal: Limitations of palloc inside checkpointer

On Tue, Jun 3, 2025 at 1:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Thanks a lot for reviewing!

I have few notes about that:
1) Should we make CompactCheckpointerRequestQueue() process the queue
of checkpoint requests in smaller parts for the same reason we do this
in AbsorbSyncRequests()? That would require significant redesign of
the algorithm, but still.

In AbsorbSyncRequests, we process requests incrementally in batches to
avoid allocating more than 1 GB of memory, which would lead to
repeated failure. I think this is less concerning in
CompactCheckpointerRequestQueue, because if we caps num_requests at 10
million, the hash table peaks at ~500 MB and skip_slot[] at ~10
MB—both under 1 GB.

Right, but another point is to avoid lengthy holding of
CheckpointerCommLock. What do you think about that?

2) That's pretty independent to the changes by the patch, but should
CompactCheckpointerRequestQueue() fill the gaps with entries from the
tail instead of rewriting the whole queue? That might be a bit
faster.

This optimization would be quite helpful for compacting large queues.
For small ones, it may also add extra costs. Can we use a hybrid
approach? If it's independent, should we create a standalone patch for
it?

Why do you think this would add extra cost of class queues? Given
we're basically rewriting the algorithm of
CompactCheckpointerRequestQueue(), I think it's OK to integrate this
into once patch.

3) For sure, we wouldn't backpatch this. Can we prepare some simple
solution for back branches? Perhaps, just introduction of
MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
than 1GB.

I think this would work well for back branches.

OK.

------
Regards,
Alexander Korotkov
Supabase

#17

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Ekaterina Sokolova (#1)

Re: Proposal: Limitations of palloc inside checkpointer

Hi all,

Sorry—I forgot to Cc on my previous message. Resending here so they’re
on the thread:

Show quoted text

On Wed, Jun 4, 2025 at 11:07 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks again for the feedback!

1) Batch-processing CompactCheckpointerRequestQueue() and AbsorbSyncRequests()?

After some thoughts, I realized my previous take was incomplete—sorry
for the confusion. Heikki suggested capping num_requests at 10 million
[1]. With that limit, the largest hash table is ~500 MB and the
skip_slot[] array is ~10 MB in CompactCheckpointerRequestQueue and the
max size of request array in AbsorbSyncRequests is well under 400 MB,
so we never exceed 1 GB. Even without batching, compaction stays under
the cap. Batching in AbsorbSyncRequests may still help by amortizing
memory allocation, but it adds extra lock/unlock overhead. Not sure if
that overhead is worth it under the cap.

Of course, all of this depends on having a cap in place. Picking the
right cap size can be tricky (see point 2). If we decide not to
enforce a cap now or in future versions, then batching both
CompactCheckpointerRequestQueue(maybe?) and AbsorbSyncRequests become
essential. We also need to consider the batch size—Heikki suggested 10
k for AbsorbSyncRequests—but I’m not sure whether that suits typical
or extreme workloads.

Right, but another point is to avoid lengthy holding of
CheckpointerCommLock. What do you think about that?

I am not clear on this. Could you elaborate on it?

[1] /messages/by-id/c1993b75-a5bc-42fd-bbf1-6f06a1b37107@iki.fi

2) Back-branch fixes with MAX_CHECKPOINT_REQUESTS?

This is simple and effective, but can be hard to get the value right.
I think we should think more of it. For very large-scale use cases,
like hundreds of GB shared_buffers, 10 million seems small if the
checkpointer is not able to absorb the changes before the queue fills
up. In this case, making compaction more efficient like 3) would be
helpful. However, if we do this for back-branch as well, the solution
is not that simple any more.

3) Fill gaps by pulling from the tail instead of rewriting the whole queue?

I misunderstood at first—this is a generally helpful optimization.
I'll integrate it into the current patch.

Import Notes

Reply to msg id not found: CABPTF7URa2rqsWJy9THkaebzT9dOmz21sOi-zNOG2kgyZYi+yQ@mail.gmail.com

#18

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Ekaterina Sokolova (#1)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

Thanks for the feedback!

I think it would be good start point to use the same batch size of
CompactCheckpointerRequestQueue() and AbsorbSyncRequests()

So we’ll keep both batch processing and the request cap in place for now.

Right, but another point is to avoid lengthy holding of
CheckpointerCommLock. What do you think about that?

I am not clear on this. Could you elaborate on it?

See [1] for more detailed description of this.

Links.
1. /messages/by-id/db4534f83a22a29ab5ee2566ad86ca92@postgrespro.ru

I read the thread but didn’t find a specific explanation of avoiding
long lock holds. My understanding is: when compaction processes a very
large queue in one go, it holds CheckpointerCommLock the entire time,
blocking all ForwardSyncRequest callers. Batch processing would
release the lock after each chunk, allowing other backends to make
progress. Is that correct?

Import Notes

Reply to msg id not found: CAPpHfdtP3zkSnUSPjG8fve3QCgC3vQsLGwOpyusZgusykU19JQ@mail.gmail.com

#19

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Ekaterina Sokolova (#1)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

3) Fill gaps by pulling from the tail instead of rewriting the whole

queue?

I misunderstood at first—this is a generally helpful optimization.
I'll integrate it into the current patch.

Great, thank you.

I dug deeper into the “fill gaps from the tail” optimization and
implemented a version of it. The tricky part is not the copy itself but
guaranteeing that the queue ends up hole-free and that tail really points
at the slot after the last live request. With a twin-cursor gap-fill we
refuse to move SYNC_FORGET_REQUEST / SYNC_FILTER_REQUEST (they’re
order-sensitive fences).

If the final survivor is one of those barriers, the cursors meet while a
hole still exists immediately before the barrier:

head → A [hole] FILTER(X) …unused…

If we then compute tail = (head + remaining_requests) % max_requests, the
value lands inside the live region (on the barrier itself). The
invariant (head
+ num_requests) % max_requests == tail is broken, so the next enqueue
overwrites live data or the checkpointer under-scans the queue.

Alternatively, we may allow relocating SYNC_FORGET_REQUEST and
SYNC_FILTER_REQUEST entries, but ensuring their ordering semantics remain
correct would be quite challenging. That concern is why the implementation
uses a forward-scan compaction. As the source comment noted:

/*
* The basic idea here is that a request can be skipped if it's followed
* by a later, identical request. It might seem more sensible to work
* backwards from the end of the queue and check whether a request is
* *preceded* by an earlier, identical request, in the hopes of doing less
* copying. But that might change the semantics, if there's an
* intervening SYNC_FORGET_REQUEST or SYNC_FILTER_REQUEST, so we do it
* this way.

Best,
Xuneng

Import Notes

Reply to msg id not found: CAPpHfdtP3zkSnUSPjG8fve3QCgC3vQsLGwOpyusZgusykU19JQ@mail.gmail.com

#20

Xuneng Zhou

xunengzhou@gmail.com

7 months ago

In reply to: Xuneng Zhou (#19)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

Patch v7 modifies CompactCheckpointerRequestQueue() to process requests
incrementally in batches of CKPT_REQ_BATCH_SIZE (10,000), similar to the
approach used in AbsorbSyncRequests(). This limits memory usage from
O(num_requests) to O(batch_size) for both hash tables and skip arrays.

- Hash table memory bounded by batch size regardless of total queue size

- Skip array allocation limited to batch size instead of max_requests

- Prevents potential OOM conditions with very large request queues

Trade-offs

Cross-batch duplicate detection: The incremental approach won't detect
duplicates spanning batch boundaries. This limitation seems acceptable
since:

- The main issue need to be addressed is preventing memory allocation
failures

- Remaining duplicates are still handled by the RememberSyncRequest()
function in the sync subsystem

- The purpose of this function is to make some rooms for new requests
not remove all duplicates.

Lock holding Duration

Andres pointed out[1]https://postgrespro.com/list/id/bjno37ickfafixkqmd2lcyopsajnuig5mm4rg6tn2ackpqyiba@w3sjfo3usuos that compacting a very large queue takes considerable
time, and holding the exclusive lock for an extended period makes it much
more likely that backends will have to perform syncs themselves - which is
exactly what CompactCheckpointerRequestQueue() is trying to avoid in the
first place. However, releasing the lock between batches would introduce
race conditions that would make the design much more complex. Given that
the primary goal of this patch is to avoid large memory allocations, I keep
the lock held for the whole function for simplicity now.

[1]: https://postgrespro.com/list/id/bjno37ickfafixkqmd2lcyopsajnuig5mm4rg6tn2ackpqyiba@w3sjfo3usuos
https://postgrespro.com/list/id/bjno37ickfafixkqmd2lcyopsajnuig5mm4rg6tn2ackpqyiba@w3sjfo3usuos

Best regards,

Xuneng

Attachments:

v7-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchapplication/octet-stream; name=v7-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchDownload

From 6b8bcb44bb73811522416cc222a2e6d5eb8c6938 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 26 Jun 2025 21:23:08 +0800
Subject: [PATCH v7] Process sync requests incrementally in AbsorbSyncRequests
 and CompactCheckpointerRequestQueue

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure. This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

Similarly, CompactCheckpointerRequestQueue() can face the same memory
allocation issues when the request queue contains millions of entries,
requiring large hash tables that can exceed available memory.

To avoid this, we process requests incrementally in batches in both
functions. This patch introduces bounded memory usage by limiting
allocations to CKPT_REQ_BATCH_SIZE regardless of the total number of
requests. For CompactCheckpointerRequestQueue(), this changes memory
usage from O(num_requests) to O(batch_size) for both hash tables and
skip arrays, while accepting the trade-off that duplicates spanning
batch boundaries won't be detected.

This ensures that memory usage stays within a safe range, avoids
excessive allocations.
---
 src/backend/postmaster/checkpointer.c | 259 +++++++++++++++++---------
 1 file changed, 171 insertions(+), 88 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..4c0a7204c23 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -127,6 +127,11 @@ typedef struct
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
+
+	int			head;			/* Index of the first request in the ring
+								 * buffer */
+	int			tail;			/* Index of the last request in the ring
+								 * buffer */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -135,6 +140,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /* interval for calling AbsorbSyncRequests in CheckpointWriteDelay */
 #define WRITES_PER_ABSORB		1000
 
+/* Maximum number of checkpointer requests to process in one batch */
+#define CKPT_REQ_BATCH_SIZE 10000
+
+/* Max number of requests the checkpointer request queue can hold */
+#define MAX_CHECKPOINT_REQUESTS 10000000
+
 /*
  * GUC parameters
  */
@@ -970,7 +981,8 @@ CheckpointerShmemInit(void)
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
+		CheckpointerShmem->head = CheckpointerShmem->tail = 0;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
 	}
@@ -1148,6 +1160,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
+	int			insert_pos;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1171,10 +1184,14 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	}
 
 	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	insert_pos = CheckpointerShmem->tail;
+	request = &CheckpointerShmem->requests[insert_pos];
 	request->ftag = *ftag;
 	request->type = type;
 
+	CheckpointerShmem->tail = (CheckpointerShmem->tail + 1) % CheckpointerShmem->max_requests;
+	CheckpointerShmem->num_requests++;
+
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
 				CheckpointerShmem->max_requests / 2);
@@ -1209,6 +1226,12 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
  * Trying to do this every time the queue is full could lose if there
  * aren't any removable entries.  But that should be vanishingly rare in
  * practice: there's one queue entry per shared buffer.
+ *
+ * To avoid large memory allocations when the queue contains many entries,
+ * we process requests incrementally in batches of CKPT_REQ_BATCH_SIZE.
+ * This limits memory usage to O(batch_size) instead of O(num_requests).
+ * Note that duplicates spanning batch boundaries won't be detected, but
+ * this trade-off is acceptable for memory scalability.
  */
 static bool
 CompactCheckpointerRequestQueue(void)
@@ -1216,15 +1239,17 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int			ring_idx;
 	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
+	int			n;
+	int			total_num_skipped = 0;
+	int			head;
+	int			max_requests;
+	int			num_requests;
+	int			read_idx,
+				write_idx;
+	int			batch_start;
 	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
 
 	/* must hold CheckpointerCommLock in exclusive mode */
 	Assert(LWLockHeldByMe(CheckpointerCommLock));
@@ -1233,81 +1258,118 @@ CompactCheckpointerRequestQueue(void)
 	if (CritSectionCount > 0)
 		return false;
 
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+	max_requests = CheckpointerShmem->max_requests;
+	num_requests = CheckpointerShmem->num_requests;
+	head = CheckpointerShmem->head;
 
-	/* Initialize temporary hash table */
+	/* Setup hash table control structure once */
 	ctl.keysize = sizeof(CheckpointerRequest);
 	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
 	ctl.hcxt = CurrentMemoryContext;
 
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	/* Process and compact in batches */
+	read_idx = head;
+	write_idx = head;
+	batch_start = 0;
 
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening SYNC_FORGET_REQUEST or SYNC_FILTER_REQUEST, so we do it
-	 * this way.  It would be possible to be even smarter if we made the code
-	 * below understand the specific semantics of such requests (it could blow
-	 * away preceding entries that would end up being canceled anyhow), but
-	 * it's not clear that the extra complexity would buy us anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	while (batch_start < num_requests)
 	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
+		int			batch_size = Min(num_requests - batch_start, CKPT_REQ_BATCH_SIZE);
+		HTAB	   *htab;
+		bool	   *skip_slot;
+		int			batch_num_skipped = 0;
+		int			batch_read_idx;
+
+		/* Allocate skip array for this batch only */
+		skip_slot = palloc0(sizeof(bool) * batch_size);
+
+		htab = hash_create("CompactCheckpointerRequestQueue_Batch",
+						   batch_size,
+						   &ctl,
+						   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileLocator had better
-		 * contain no pad bytes.
+		 * The basic idea here is that a request can be skipped if it's followed
+		 * by a later, identical request within the same batch.  It might seem more
+		 * sensible to work backwards from the end of the queue and check whether a
+		 * request is *preceded* by an earlier, identical request, in the hopes of
+		 * doing less copying.  But that might change the semantics, if there's an
+		 * intervening SYNC_FORGET_REQUEST or SYNC_FILTER_REQUEST, so we do it
+		 * this way.  It would be possible to be even smarter if we made the code
+		 * below understand the specific semantics of such requests (it could blow
+		 * away preceding entries that would end up being canceled anyhow), but
+		 * it's not clear that the extra complexity would buy us anything.
 		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
+		batch_read_idx = read_idx;
+		for (n = 0; n < batch_size; n++)
 		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
+			CheckpointerRequest *request;
+			struct CheckpointerSlotMapping *slotmap;
+			bool		found;
+
+			/*
+			 * We use the request struct directly as a hashtable key.  This
+			 * assumes that any padding bytes in the structs are consistently the
+			 * same, which should be okay because we zeroed them in
+			 * CheckpointerShmemInit.  Note also that RelFileLocator had better
+			 * contain no pad bytes.
+			 */
+			request = &CheckpointerShmem->requests[batch_read_idx];
+			slotmap = hash_search(htab, request, HASH_ENTER, &found);
+			if (found)
+			{
+				/* Duplicate, so mark the previous occurrence as skippable */
+				skip_slot[slotmap->ring_idx] = true;
+				batch_num_skipped++;
+			}
+			/* Remember slot containing latest occurrence of this request value */
+			slotmap->ring_idx = n;  /* Index within this batch */
+			batch_read_idx = (batch_read_idx + 1) % max_requests;
 		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
 
-	/* Done with the hash table. */
-	hash_destroy(htab);
+		/* Done with the hash table. */
+		hash_destroy(htab);
 
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
+		/* Compact this batch: copy non-skipped entries */
+		for (n = 0; n < batch_size; n++)
+		{
+			/* If this slot is NOT skipped, keep it */
+			if (!skip_slot[n])
+			{
+				/* If the read and write positions are different, copy the request */
+				if (write_idx != read_idx)
+					CheckpointerShmem->requests[write_idx] = CheckpointerShmem->requests[read_idx];
+
+				/* Advance the write position */
+				write_idx = (write_idx + 1) % max_requests;
+			}
+
+			read_idx = (read_idx + 1) % max_requests;
+		}
+
+		total_num_skipped += batch_num_skipped;
+
+		/* Cleanup batch resources */
 		pfree(skip_slot);
-		return false;
-	}
 
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		batch_start += batch_size;
 	}
+
+	/* If no duplicates, we're out of luck. */
+	if (total_num_skipped == 0)
+		return false;
+
+	/*
+	 * Update ring buffer state: head remains the same, tail moves, count
+	 * decreases
+	 */
+	CheckpointerShmem->tail = write_idx;
+	CheckpointerShmem->num_requests -= total_num_skipped;
+
 	ereport(DEBUG1,
 			(errmsg_internal("compacted fsync request queue from %d entries to %d entries",
-							 CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
+							 num_requests, CheckpointerShmem->num_requests)));
 
-	/* Cleanup. */
-	pfree(skip_slot);
 	return true;
 }
 
@@ -1325,40 +1387,61 @@ AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
-	int			n;
+	int			n,
+				i;
+	bool		loop;
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	do
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
-	START_CRIT_SECTION();
+		/*
+		 * We try to avoid holding the lock for a long time by copying the
+		 * request array, and processing the requests after releasing the
+		 * lock.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable
+		 * runs out of memory).  This is because the system cannot run safely
+		 * if we are unable to fsync what we have been told to fsync.
+		 * Fortunately, the hashtable is so small that the problem is quite
+		 * unlikely to arise in practice.
+		 *
+		 * Note: we could not palloc more than 1Gb of memory, thus make sure
+		 * that the maximum number of elements will fit in the requests
+		 * buffer.
+		 */
+		n = Min(CheckpointerShmem->num_requests, CKPT_REQ_BATCH_SIZE);
+		if (n > 0)
+		{
+			if (!requests)
+				requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
 
-	CheckpointerShmem->num_requests = 0;
+			for (i = 0; i < n; i++)
+			{
+				requests[i] = CheckpointerShmem->requests[CheckpointerShmem->head];
+				CheckpointerShmem->head = (CheckpointerShmem->head + 1) % CheckpointerShmem->max_requests;
+			}
 
-	LWLockRelease(CheckpointerCommLock);
+			CheckpointerShmem->num_requests -= n;
+
+		}
+
+		START_CRIT_SECTION();
+
+		/* Are there any requests in the queue? If so, keep going. */
+		loop = CheckpointerShmem->num_requests != 0;
+
+		LWLockRelease(CheckpointerCommLock);
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		for (request = requests; n > 0; request++, n--)
+			RememberSyncRequest(&request->ftag, request->type);
 
-	END_CRIT_SECTION();
+		END_CRIT_SECTION();
+	} while (loop);
 
 	if (requests)
 		pfree(requests);
-- 
2.49.0

#21

Alexander Korotkov

aekorotkov@gmail.com

6 months ago

In reply to: Xuneng Zhou (#20)

2 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Hi, Xuneng!

On Thu, Jun 26, 2025 at 4:31 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Patch v7 modifies CompactCheckpointerRequestQueue() to process requests incrementally in batches of CKPT_REQ_BATCH_SIZE (10,000), similar to the approach used in AbsorbSyncRequests(). This limits memory usage from O(num_requests) to O(batch_size) for both hash tables and skip arrays.

- Hash table memory bounded by batch size regardless of total queue size

- Skip array allocation limited to batch size instead of max_requests

- Prevents potential OOM conditions with very large request queues

Trade-offs

Cross-batch duplicate detection: The incremental approach won't detect duplicates spanning batch boundaries. This limitation seems acceptable since:

- The main issue need to be addressed is preventing memory allocation failures

- Remaining duplicates are still handled by the RememberSyncRequest() function in the sync subsystem

- The purpose of this function is to make some rooms for new requests not remove all duplicates.

Lock holding Duration

Andres pointed out[1] that compacting a very large queue takes considerable time, and holding the exclusive lock for an extended period makes it much more likely that backends will have to perform syncs themselves - which is exactly what CompactCheckpointerRequestQueue() is trying to avoid in the first place. However, releasing the lock between batches would introduce race conditions that would make the design much more complex. Given that the primary goal of this patch is to avoid large memory allocations, I keep the lock held for the whole function for simplicity now.

[1] https://postgrespro.com/list/id/bjno37ickfafixkqmd2lcyopsajnuig5mm4rg6tn2ackpqyiba@w3sjfo3usuos

I've reviewed the v7 of patch. I can note following.
1) The patch makes CompactCheckpointerRequestQueue() process queue in
batches, but doesn't release the lock between batches. The algorithm
becomes more complex, and it holds the lock for the same (or probably
longer) time. We trade possibility to miss some duplicates to less
memory consumption. However, with MAX_CHECKPOINT_REQUESTS limit,
maximal memory consumption shouldn't be too high anyway.
2) Even if we manage to release the lock between the batches then we
still need some additional coordination to prevent two processed doing
CompactCheckpointerRequestQueue() simultaneously.
3) Another problem of releasing the lock is that in spite of
AbsorbSyncRequests() there is no work to do while not holding the
lock. Releasing the lock and immediately taking it back have a little
use: other processes have almost no chance to grab the lock.

In the token of all of above, I think it's not reasonable to do
CompactCheckpointerRequestQueue() in batches. Sorry for the
confusion.

Regarding the gap filling, I've got from [1] that the requests order
matters. Then gap filling strategy is impossible or too complex. The
point is withdrawn. I think v6 is basically good enough.

The v8 is attached. It's basically the same as v6, but has revised
commit message and comments.

The patch for stable branches is also attached: it just limits the
maximum size of the checkpointer requests queue.

Links.
1. /messages/by-id/CABPTF7XSSecQ-k7k9cQJsA3ACHmCVwdoRfv4DxOMom4cNQL=5Q@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v8-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchapplication/octet-stream; name=v8-0001-Process-sync-requests-incrementally-in-AbsorbSync.patchDownload

From 9583e36ad93255120f082f67f08a56ffd27bb13a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 23 Jul 2025 01:33:19 +0300
Subject: [PATCH v8] Process sync requests incrementally in AbsorbSyncRequests

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure.  This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

This commit introduces the following changes to cope with this problem:
 1. Turn pending checkpointer requests array in shared memory into a bounded
    ring buffer.
 2. Limit maximum ring buffer size to 10M items.
 3. Make AbsorbSyncRequests() process requests incrementally in 10K batches.

Even #2 makes the whole queue size fit the maximum palloc() size of 1 GB.
of continuous lock holding.

This commit is for master only.  Simpler fix, which just limits a request
queue size to 10M, will be backpatched.

Reported-by: Ekaterina Sokolova <e.sokolova@postgrespro.ru>
Discussion: https://postgr.es/m/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru
Author: Maxim Orlov <orlovmg@gmail.com>
Co-authored-by:  Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
---
 src/backend/postmaster/checkpointer.c | 155 +++++++++++++++++++-------
 1 file changed, 114 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 2809e298a44..7d235e382ec 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -130,6 +130,13 @@ typedef struct
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
+
+	int			head;			/* Index of the first request in the ring
+								 * buffer */
+	int			tail;			/* Index of the last request in the ring
+								 * buffer */
+
+	/* The ring buffer of pending checkpointer requests */
 	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
@@ -138,6 +145,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /* interval for calling AbsorbSyncRequests in CheckpointWriteDelay */
 #define WRITES_PER_ABSORB		1000
 
+/* Maximum number of checkpointer requests to process in one batch */
+#define CKPT_REQ_BATCH_SIZE 10000
+
+/* Max number of requests the checkpointer request queue can hold */
+#define MAX_CHECKPOINT_REQUESTS 10000000
+
 /*
  * GUC parameters
  */
@@ -973,7 +986,8 @@ CheckpointerShmemInit(void)
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
+		CheckpointerShmem->head = CheckpointerShmem->tail = 0;
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
 	}
@@ -1201,6 +1215,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
+	int			insert_pos;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1224,10 +1239,14 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 	}
 
 	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	insert_pos = CheckpointerShmem->tail;
+	request = &CheckpointerShmem->requests[insert_pos];
 	request->ftag = *ftag;
 	request->type = type;
 
+	CheckpointerShmem->tail = (CheckpointerShmem->tail + 1) % CheckpointerShmem->max_requests;
+	CheckpointerShmem->num_requests++;
+
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
 				CheckpointerShmem->max_requests / 2);
@@ -1269,12 +1288,16 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int			ring_idx;
 	};
 
-	int			n,
-				preserve_count;
+	int			n;
 	int			num_skipped = 0;
+	int			head;
+	int			max_requests;
+	int			num_requests;
+	int			read_idx,
+				write_idx;
 	HASHCTL		ctl;
 	HTAB	   *htab;
 	bool	   *skip_slot;
@@ -1286,8 +1309,13 @@ CompactCheckpointerRequestQueue(void)
 	if (CritSectionCount > 0)
 		return false;
 
+	max_requests = CheckpointerShmem->max_requests;
+	num_requests = CheckpointerShmem->num_requests;
+
 	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+	skip_slot = palloc0(sizeof(bool) * max_requests);
+
+	head = CheckpointerShmem->head;
 
 	/* Initialize temporary hash table */
 	ctl.keysize = sizeof(CheckpointerRequest);
@@ -1311,7 +1339,8 @@ CompactCheckpointerRequestQueue(void)
 	 * away preceding entries that would end up being canceled anyhow), but
 	 * it's not clear that the extra complexity would buy us anything.
 	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
 		CheckpointerRequest *request;
 		struct CheckpointerSlotMapping *slotmap;
@@ -1324,16 +1353,19 @@ CompactCheckpointerRequestQueue(void)
 		 * CheckpointerShmemInit.  Note also that RelFileLocator had better
 		 * contain no pad bytes.
 		 */
-		request = &CheckpointerShmem->requests[n];
+		request = &CheckpointerShmem->requests[read_idx];
 		slotmap = hash_search(htab, request, HASH_ENTER, &found);
 		if (found)
 		{
 			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
+			skip_slot[slotmap->ring_idx] = true;
 			num_skipped++;
 		}
 		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
+		slotmap->ring_idx = read_idx;
+
+		/* Move to the next request in the ring buffer */
+		read_idx = (read_idx + 1) % max_requests;
 	}
 
 	/* Done with the hash table. */
@@ -1347,17 +1379,34 @@ CompactCheckpointerRequestQueue(void)
 	}
 
 	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
+	read_idx = write_idx = head;
+	for (n = 0; n < num_requests; n++)
 	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		/* If this slot is NOT skipped, keep it */
+		if (!skip_slot[read_idx])
+		{
+			/* If the read and write positions are different, copy the request */
+			if (write_idx != read_idx)
+				CheckpointerShmem->requests[write_idx] =
+					CheckpointerShmem->requests[read_idx];
+
+			/* Advance the write position */
+			write_idx = (write_idx + 1) % max_requests;
+		}
+
+		read_idx = (read_idx + 1) % max_requests;
 	}
+
+	/*
+	 * Update ring buffer state: head remains the same, tail moves, count
+	 * decreases
+	 */
+	CheckpointerShmem->tail = write_idx;
+	CheckpointerShmem->num_requests -= num_skipped;
+
 	ereport(DEBUG1,
 			(errmsg_internal("compacted fsync request queue from %d entries to %d entries",
-							 CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
+							 num_requests, CheckpointerShmem->num_requests)));
 
 	/* Cleanup. */
 	pfree(skip_slot);
@@ -1378,40 +1427,64 @@ AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
-	int			n;
+	int			n,
+				i;
+	bool		loop;
 
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	do
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
-	}
+		LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
+
+		/*---
+		 * We try to avoid holding the lock for a long time by:
+		 * 1. Copying the request array, and processing the requests after
+		 *    releasing the lock;
+		 * 2. Processing not the whole queue, but only batches of
+		 *    CKPT_REQ_BATCH_SIZE at once.
+		 *
+		 * Once we have cleared the requests from shared memory, we have to
+		 * PANIC if we then fail to absorb them (eg, because our hashtable
+		 * runs out of memory).  This is because the system cannot run safely
+		 * if we are unable to fsync what we have been told to fsync.
+		 * Fortunately, the hashtable is so small that the problem is quite
+		 * unlikely to arise in practice.
+		 *
+		 * Note: The maximum possible size of a ring buffer is
+		 * MAX_CHECKPOINT_REQUESTS entries, which fit into a maximum palloc
+		 * allocation size of 1Gb.  Our maximum batch size,
+		 * CKPT_REQ_BATCH_SIZE, is even smaller.
+		 */
+		n = Min(CheckpointerShmem->num_requests, CKPT_REQ_BATCH_SIZE);
+		if (n > 0)
+		{
+			if (!requests)
+				requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
 
-	START_CRIT_SECTION();
+			for (i = 0; i < n; i++)
+			{
+				requests[i] = CheckpointerShmem->requests[CheckpointerShmem->head];
+				CheckpointerShmem->head = (CheckpointerShmem->head + 1) % CheckpointerShmem->max_requests;
+			}
 
-	CheckpointerShmem->num_requests = 0;
+			CheckpointerShmem->num_requests -= n;
 
-	LWLockRelease(CheckpointerCommLock);
+		}
+
+		START_CRIT_SECTION();
+
+		/* Are there any requests in the queue? If so, keep going. */
+		loop = CheckpointerShmem->num_requests != 0;
+
+		LWLockRelease(CheckpointerCommLock);
 
-	for (request = requests; n > 0; request++, n--)
-		RememberSyncRequest(&request->ftag, request->type);
+		for (request = requests; n > 0; request++, n--)
+			RememberSyncRequest(&request->ftag, request->type);
 
-	END_CRIT_SECTION();
+		END_CRIT_SECTION();
+	} while (loop);
 
 	if (requests)
 		pfree(requests);
-- 
2.39.5 (Apple Git-154)

v1-backpatch-0001-Limit-checkpointer-requests-queue-size.patchapplication/octet-stream; name=v1-backpatch-0001-Limit-checkpointer-requests-queue-size.patchDownload

From 8aeca70a90956387432a95fdd3b6365a785f0bbb Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 23 Jul 2025 01:27:44 +0300
Subject: [PATCH v1-backpatch] Limit checkpointer requests queue size

If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure.  This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

This commit limits the checkpointer requests queue size to 10M items. In
addition to preventing the palloc() failure, this change helps to avoid long
queue processing time.

Also, this commit is for backpathing only.  The master branch receives
a more invasive yet comprehensive fix for this problem.

Discussion: https://postgr.es/m/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru
Backpatch-through: 13
---
 src/backend/postmaster/checkpointer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..903d83e7dea 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -135,6 +135,9 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /* interval for calling AbsorbSyncRequests in CheckpointWriteDelay */
 #define WRITES_PER_ABSORB		1000
 
+/* Max number of requests the checkpointer request queue can hold */
+#define MAX_CHECKPOINT_REQUESTS 10000000
+
 /*
  * GUC parameters
  */
@@ -970,7 +973,7 @@ CheckpointerShmemInit(void)
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
 		ConditionVariableInit(&CheckpointerShmem->start_cv);
 		ConditionVariableInit(&CheckpointerShmem->done_cv);
 	}
-- 
2.39.5 (Apple Git-154)

#22

Xuneng Zhou

xunengzhou@gmail.com

6 months ago

In reply to: Alexander Korotkov (#21)

Re: Proposal: Limitations of palloc inside checkpointer

Hi, Alexander

Thanks for reviewing and feedback!

The v8 is attached. It's basically the same as v6, but has revised
commit message and comments.

The patch for stable branches is also attached: it just limits the
maximum size of the checkpointer requests queue.

LGTM.

Best,
Xuneng

#23

Alexander Korotkov

aekorotkov@gmail.com

6 months ago

In reply to: Xuneng Zhou (#22)

Re: Proposal: Limitations of palloc inside checkpointer

On Fri, Jul 25, 2025 at 3:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Thanks for reviewing and feedback!

The v8 is attached. It's basically the same as v6, but has revised
commit message and comments.

The patch for stable branches is also attached: it just limits the
maximum size of the checkpointer requests queue.

LGTM.

Good, thank you.
I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

#24

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Alexander Korotkov (#23)

Re: Proposal: Limitations of palloc inside checkpointer

Alexander Korotkov <aekorotkov@gmail.com> writes:

I'm going to push this if no objections.

I looked at these patches while preparing release notes, and
found an oversight. CheckpointerShmemInit does

CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);

but CheckpointerShmemSize still does

size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));

So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
extra CheckpointerRequest array entries that we will never use,
wasting shared memory. Admittedly the amount is small relative to the
shared buffers themselves, but at the very least this is confusing.

The comment in CheckpointerShmemSize needs adjustment, too.

Also, the reason I was studying it so carefully is that I could not
figure out who to credit for the back-patched fix. It looks like
the original suggestion for the minimal fix was Alexander's, but
other people seem to have had their fingers in the actual patch
writing --- or am I misreading the thread?

regards, tom lane

#25

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Tom Lane (#24)

1 attachment(s)

Re: Proposal: Limitations of palloc inside checkpointer

Hi, Tom!

Thanks for catching this.

On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

I'm going to push this if no objections.

I looked at these patches while preparing release notes, and
found an oversight. CheckpointerShmemInit does

CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);

but CheckpointerShmemSize still does

size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));

So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
extra CheckpointerRequest array entries that we will never use,
wasting shared memory. Admittedly the amount is small relative to the
shared buffers themselves, but at the very least this is confusing.

The comment in CheckpointerShmemSize needs adjustment, too.

I attached a patch to fix it.

Best,
Xuneng

Attachments:

v1-0001-Fix-checkpointer-shared-memory-allocation.patchapplication/octet-stream; name=v1-0001-Fix-checkpointer-shared-memory-allocation.patchDownload

From 3d9cb6b21a6e44e892e41cba35dd1caff3f437d1 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 7 Aug 2025 15:03:10 +0800
Subject: [PATCH v1] Fix checkpointer shared memory allocation

Use MAX_CHECKPOINT_REQUESTS instead of NBuffers in CheckpointerShmemSize
to match the actual array size limit set in CheckpointerShmemInit.
This prevents wasting shared memory when NBuffers > MAX_CHECKPOINT_REQUESTS.

Also fix the comment.
---
 src/backend/postmaster/checkpointer.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 8490148a47d..8d52b24ed05 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -953,11 +953,10 @@ CheckpointerShmemSize(void)
 	Size		size;
 
 	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
+	 * The size of the requests[] array is capped at MAX_CHECKPOINT_REQUESTS.
 	 */
 	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
+	size = add_size(size, mul_size(MAX_CHECKPOINT_REQUESTS, sizeof(CheckpointerRequest)));
 
 	return size;
 }
-- 
2.49.0

#26

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#25)

Re: Proposal: Limitations of palloc inside checkpointer

Hi,

On Thu, Aug 7, 2025 at 3:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi, Tom!

Thanks for catching this.

On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

I'm going to push this if no objections.

I looked at these patches while preparing release notes, and
found an oversight. CheckpointerShmemInit does

CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);

but CheckpointerShmemSize still does

size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));

So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
extra CheckpointerRequest array entries that we will never use,
wasting shared memory. Admittedly the amount is small relative to the
shared buffers themselves, but at the very least this is confusing.

The comment in CheckpointerShmemSize needs adjustment, too.

I attached a patch to fix it.

Sorry for the error in this patch. I forgot to take the min of
NBuffers and MAX_CHECKPOINT_REQUESTS for mem size calculation.
Many thanks to Alexander for correcting and pushing it.

#27

Alexander Korotkov

aekorotkov@gmail.com

5 months ago

In reply to: Xuneng Zhou (#26)

Re: Proposal: Limitations of palloc inside checkpointer

On Thu, Aug 7, 2025 at 4:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Aug 7, 2025 at 3:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi, Tom!

Thanks for catching this.

On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

I'm going to push this if no objections.

I looked at these patches while preparing release notes, and
found an oversight. CheckpointerShmemInit does

CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);

but CheckpointerShmemSize still does

size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));

So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
extra CheckpointerRequest array entries that we will never use,
wasting shared memory. Admittedly the amount is small relative to the
shared buffers themselves, but at the very least this is confusing.

The comment in CheckpointerShmemSize needs adjustment, too.

I attached a patch to fix it.

Sorry for the error in this patch. I forgot to take the min of
NBuffers and MAX_CHECKPOINT_REQUESTS for mem size calculation.
Many thanks to Alexander for correcting and pushing it.

No problem. Thank you for the patch. And thanks to Tom for catching this.

------
Regards,
Alexander Korotkov
Supabase