Identify huge pages accessibility using madvise

Started by Dmitry Dolgovover 1 year ago5 messages
#1Dmitry Dolgov
9erthalion6@gmail.com
4 attachment(s)

Hi,

I would like to propose a small patch to address an annoying issue with
the way how PostgreSQL does fallback in case if "huge_pages = try" is
set. Here is how the problem looks like:

* PostgreSQL is starting on a machine with some huge pages available

* It tries to identify that fact and does mmap with MAP_HUGETLB, which
succeeds

* But it has a pleasure to run inside a cgroup with a hugetlb
controller and limits set to 0 (or anything less than PostgreSQL
needs)

* Under this circumstances PostgreSQL will proceed allocating huge
pages, but the first page fault will trigger SIGBUS

I've sketched out how to reproduce it with cgroup v1 and v2 in the
attached scripts.

This sounds like quite a rare combination of factors, but apparently
it's fairly easy to face this on K8s/OpenShift. There was a bug reported
some time ago [1]/messages/by-id/HE1PR0701MB256920EEAA3B2A9C06249F339E110@HE1PR0701MB2569.eurprd07.prod.outlook.com about this behaviour, and back then I was under the
impression it's a solved matter with nothing to do. Yet I still observe
this type of issues, the latest one not longer than a week ago.

After some research I found what looks to me like a relatively simple
way to address the problem. In Linux kernel 5.14 a new flag to madvise
was introduced that might be just what we need here. It's called
MADV_POPULATE_READ [2]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb and it tells kernel to populate page tables by
triggering read faults if required. One by-design feature of this flag
is to fail the madvise call in the situations like one above, giving an
opportunity to avoid SIGBUS.

I've outlined a patch to implement this approach and tested it on a
newish Linux kernel I've got lying around (6.9.0-rc1) -- no SIGBUS,
PostgreSQL does fallback to not use huge pages. The resulting change
seems to be small enough to justify addressing this small but annoying
issue. Any thoughts or commentaries about the proposal?

[1]: /messages/by-id/HE1PR0701MB256920EEAA3B2A9C06249F339E110@HE1PR0701MB2569.eurprd07.prod.outlook.com
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb

Attachments:

v1-0001-Identify-huge-pages-accesibility-using-madvise.patchtext/plain; charset=us-asciiDownload
From 0001d43117dc5cad08fb0908a3e50a00c56f88d3 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 13 Apr 2024 11:31:46 +0200
Subject: [PATCH v1] Identify huge pages accesibility using madvise

Currently, PostgreSQL tries to figure out whether huge pages are
available, to fallback if "huge_pages = try" is set. There is an
annoying situation that this approach cannot handle, when there are huge
pages available, but they are restricted via cgroups. If this happens
and PostgreSQL is running inside a cgroup that limits on huge pages to
0, the allocation part with mmap would work, but the very first page
fault will return SIGBUS.

To handle this situation more gracefully, add madvise call with
MADV_POPULATE_READ flag if available (it was introduced in Linux kernel
5.14). This flag tells kernel to populate page tables by triggering read
faults if required, and in the situation described above it will fail,
giving PostgreSQL an opportunity to fallback and proceed without huge
pages. Note that it's not a side effect, but rather a designed behaviour [1].

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb
---
 src/backend/port/sysv_shmem.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 1a6d8fa0fb..cbacf62066 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -600,7 +600,7 @@ CreateAnonymousSegment(Size *size)
 {
 	Size		allocsize = *size;
 	void	   *ptr = MAP_FAILED;
-	int			mmap_errno = 0;
+	int			mmap_errno = 0, madv_errno = 0;
 
 #ifndef MAP_HUGETLB
 	/* PGSharedMemoryCreate should have dealt with this case */
@@ -625,6 +625,28 @@ CreateAnonymousSegment(Size *size)
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
 				 allocsize);
+
+#ifdef MADV_POPULATE_READ
+		/*
+		 * Verifying if huge pages are available is done in two steps: first
+		 * mmap with MAP_HUGETLB, then madvise with MADV_POPULATE_READ. For the
+		 * latter the MADV_POPULATE_READ flag will tell kernel to populate page
+		 * tables by triggering read faults if required, revealing potential
+		 * access issues that otherwise would result in SIGBUS.
+		 *
+		 * If mmap fails, no huge pages are available; if it does not, there is
+		 * still possibility that huge pages are limited via cgroups. If
+		 * madvise fails, there are some huge pages, but we cannot access them
+		 * due to cgroup limitations. If both succeeds, we're good to go.
+		 */
+		if(ptr != MAP_FAILED && madvise(ptr, allocsize, MADV_POPULATE_READ) != 0)
+		{
+			elog(DEBUG1, "madvise(%zu) with MAP_HUGETLB and MADV_POPULATE_READ "
+						 "failed, huge pages disabled: %m", allocsize);
+			madv_errno = errno;
+			ptr = MAP_FAILED;
+		}
+#endif
 	}
 #endif
 
@@ -650,7 +672,11 @@ CreateAnonymousSegment(Size *size)
 
 	if (ptr == MAP_FAILED)
 	{
-		errno = mmap_errno;
+		if (mmap_errno != 0)
+			errno = mmap_errno;
+		else
+			errno = madv_errno;
+
 		ereport(FATAL,
 				(errmsg("could not map anonymous shared memory: %m"),
 				 (mmap_errno == ENOMEM) ?

base-commit: 3a4a3537a999932642ba7a459900fe3c4f5cad02
-- 
2.31.1

sigbus.shapplication/x-shDownload
cgroup-v1.shapplication/x-shDownload
cgroup-v2.shapplication/x-shDownload
#2Gabriele Bartolini
gabriele.bartolini@enterprisedb.com
In reply to: Dmitry Dolgov (#1)
Re: Identify huge pages accessibility using madvise

Hi Dmitry,

I've been attempting to replicate this issue directly in Kubernetes, but I
haven't been successful so far. I've been using EKS nodes, and it seems
that they all run cgroup v2 now. Do you have anything that could help me
get started on this more quickly?

Thanks,
Gabriele

On Sat, 13 Apr 2024 at 18:24, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Hi,

I would like to propose a small patch to address an annoying issue with
the way how PostgreSQL does fallback in case if "huge_pages = try" is
set. Here is how the problem looks like:

* PostgreSQL is starting on a machine with some huge pages available

* It tries to identify that fact and does mmap with MAP_HUGETLB, which
succeeds

* But it has a pleasure to run inside a cgroup with a hugetlb
controller and limits set to 0 (or anything less than PostgreSQL
needs)

* Under this circumstances PostgreSQL will proceed allocating huge
pages, but the first page fault will trigger SIGBUS

I've sketched out how to reproduce it with cgroup v1 and v2 in the
attached scripts.

This sounds like quite a rare combination of factors, but apparently
it's fairly easy to face this on K8s/OpenShift. There was a bug reported
some time ago [1] about this behaviour, and back then I was under the
impression it's a solved matter with nothing to do. Yet I still observe
this type of issues, the latest one not longer than a week ago.

After some research I found what looks to me like a relatively simple
way to address the problem. In Linux kernel 5.14 a new flag to madvise
was introduced that might be just what we need here. It's called
MADV_POPULATE_READ [2] and it tells kernel to populate page tables by
triggering read faults if required. One by-design feature of this flag
is to fail the madvise call in the situations like one above, giving an
opportunity to avoid SIGBUS.

I've outlined a patch to implement this approach and tested it on a
newish Linux kernel I've got lying around (6.9.0-rc1) -- no SIGBUS,
PostgreSQL does fallback to not use huge pages. The resulting change
seems to be small enough to justify addressing this small but annoying
issue. Any thoughts or commentaries about the proposal?

[1]:
/messages/by-id/HE1PR0701MB256920EEAA3B2A9C06249F339E110@HE1PR0701MB2569.eurprd07.prod.outlook.com
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb

--
Gabriele Bartolini
VP, Chief Architect, Kubernetes
enterprisedb.com

#3Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Gabriele Bartolini (#2)
Re: Identify huge pages accessibility using madvise

On Thu, Sep 26, 2024 at 07:57:12AM GMT, Gabriele Bartolini wrote:
Hi Dmitry,

I've been attempting to replicate this issue directly in Kubernetes, but I
haven't been successful so far. I've been using EKS nodes, and it seems
that they all run cgroup v2 now. Do you have anything that could help me
get started on this more quickly?

Thanks,
Gabriele

Hi Gabriele,

Thanks for testing. I can check if I can get some EKS clusters to
experiment with. In the meantime, what about the reproducing script for
cgroup v2 (the plain one that I've attached with the patch, that doesn't
require any k8s cluster), doesn't it work for you?

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Dmitry Dolgov (#3)
Re: Identify huge pages accessibility using madvise

On Thu, Sep 26, 2024 at 08:46:17AM GMT, Dmitry Dolgov wrote:

On Thu, Sep 26, 2024 at 07:57:12AM GMT, Gabriele Bartolini wrote:
Hi Dmitry,

I've been attempting to replicate this issue directly in Kubernetes, but I
haven't been successful so far. I've been using EKS nodes, and it seems
that they all run cgroup v2 now. Do you have anything that could help me
get started on this more quickly?

Thanks for testing. I can check if I can get some EKS clusters to
experiment with. In the meantime, what about the reproducing script for
cgroup v2 (the plain one that I've attached with the patch, that doesn't
require any k8s cluster), doesn't it work for you?

Looks like there is a plot twist. After talking to Gabriele off list and
testing on an EKS, I've discovered that since 5.7 Linux kernel supports
hugetlb reservation via hugetlbfs [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdc2fcfea79b9873bb63159f8ed973f4046018c8. That means that together with the
original limitation at page fault time there is one at reservation time,
which has a separate knob in cgroupfs:

# cgroup v2, hugetlb controller
#
# original limit, page fault level
hugetlb.2MB.limit_in_bytes
#
# new one, reservation level
hugetlb.2MB.rsvd.limit_in_bytes

This means that there still could be people facing the original issue patch is
trying to address: for that one needs to either run older kernel, or have a
container orchestration tool that do not set rsvd value (looks like there are
such examples). But in the long term perspective I would expect everyone
converging to use reservation limits correctly, so maybe the patch is not
needed after all.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdc2fcfea79b9873bb63159f8ed973f4046018c8

#5Peter Eisentraut
peter@eisentraut.org
In reply to: Dmitry Dolgov (#4)
Re: Identify huge pages accessibility using madvise

On 08.11.24 09:54, Dmitry Dolgov wrote:

Looks like there is a plot twist. After talking to Gabriele off list and
testing on an EKS, I've discovered that since 5.7 Linux kernel supports
hugetlb reservation via hugetlbfs [1]. That means that together with the
original limitation at page fault time there is one at reservation time,
which has a separate knob in cgroupfs:

# cgroup v2, hugetlb controller
#
# original limit, page fault level
hugetlb.2MB.limit_in_bytes
#
# new one, reservation level
hugetlb.2MB.rsvd.limit_in_bytes

This means that there still could be people facing the original issue patch is
trying to address: for that one needs to either run older kernel, or have a
container orchestration tool that do not set rsvd value (looks like there are
such examples). But in the long term perspective I would expect everyone
converging to use reservation limits correctly, so maybe the patch is not
needed after all.

Ah good, it looks like the issue was addressed properly in the kernel
then, and we don't need the workaround your patch proposes anymore.

So, I think we don't need to proceed with your patch. The issue will
hopefully go away over time (or has already), and those who are still
affected by it for some reason can refer to this thread for discussion
and maybe choose to apply the patch on their own.