remap the .text segment into huge pages at run time

Started by John Naylorabout 3 years ago16 messages
#1John Naylor
john.naylor@enterprisedb.com
2 attachment(s)

It's been known for a while that Postgres spends a lot of time translating
instruction addresses, and using huge pages in the text segment yields a
substantial performance boost in OLTP workloads [1]https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf (paper: "On the Impact of Instruction Address Translation Overhead")[2]https://twitter.com/AndresFreundTec/status/1214305610172289024. The difficulty is,
this normally requires a lot of painstaking work (unless your OS does
superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3]https://github.com/intel/iodlr that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text
segment
- mmap aligned start address to a second region with huge pages and
MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the
remapping is done in a function that's forced to live in a different
segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code
region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that
still saves 512 entries in the TLB and might bring a small improvement. The
un-remapped region below 0x600000 contains the ~600kB of "cold" code, since
the linker puts the cold section first, at least recent versions of ld and
lld.

0002 is my attempt to force the linker's hand and get the entire text
segment mapped to huge pages. It's quite a finicky hack, and easily broken
(see below). That said, it still builds easily within our normal build
process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152
-Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
done for predictability, but that means the next 2MB boundary is very
nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

FILE SIZE VM SIZE
-------------- --------------
53.7% 4.90Mi 58.7% 4.90Mi .text
...
100.0% 9.12Mi 100.0% 8.35Mi TOTAL

$ readelf -S --wide inst-perf/bin/postgres

[Nr] Name Type Address Off Size ES
Flg Lk Inf Al
...
[12]: .init PROGBITS 0000000000600000 200000 00001b 00 AX 0 0 4
AX 0 0 4
[13]: .plt PROGBITS 0000000000600020 200020 001520 10 AX 0 0 16
AX 0 0 16
[14]: .text PROGBITS 0000000000601540 201540 7ff512 00 AX 0 0 16 ...
AX 0 0 16
...

0002:

$ bloaty inst-perf/bin/postgres

FILE SIZE VM SIZE
-------------- --------------
46.9% 8.00Mi 69.9% 8.00Mi .text
...
100.0% 17.1Mi 100.0% 11.4Mi TOTAL

$ readelf -S --wide inst-perf/bin/postgres

[Nr] Name Type Address Off Size ES
Flg Lk Inf Al
...
[12]: .init PROGBITS 0000000000600000 200000 00001b 00 AX 0 0 4
AX 0 0 4
[13]: .plt PROGBITS 0000000000600020 200020 001520 10 AX 0 0 16
AX 0 0 16
[14]: .text PROGBITS 0000000000601540 201540 7ff512 00 AX 0 0 16 ...
AX 0 0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code
region

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
causes complete startup failure if the .text segment is larger than 8MB. I
haven't looked into what's happening there yet, but I didn't want to get
too far in the weeds before getting feedback on whether the entire approach
in this thread is sound enough to justify working further on.

[1]: https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf (paper: "On the Impact of Instruction Address Translation Overhead")
(paper: "On the Impact of Instruction Address Translation Overhead")
[2]: https://twitter.com/AndresFreundTec/status/1214305610172289024
[3]: https://github.com/intel/iodlr

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v1-0002-Put-all-non-cold-.text-in-huge-pages.patchapplication/x-patch; name=v1-0002-Put-all-non-cold-.text-in-huge-pages.patchDownload
From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 31 Oct 2022 13:59:30 +0700
Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages

Tell linker to align addresses on 2MB boundaries. The .init
section will be so aligned, with the .text section soon after that.
Therefore, the start address of .text should always be align up to
nearly 2MB ahead of the actual start. The first nearly 2MB of .text
will not map to huge pages.

We count on cold sections linking to the front of the .text segment:
Since the cold sections total about 600kB in size, we need ~1.4MB of
additional padding to keep non-cold pages mappable to huge pages. Since
PG has about 5.0MB of .text, we also need an additional 1MB to push
the .text end just past an aligned boundary, so when we align the end
down, only a small number of pages will remain un-remapped at their
original 4kB size.
---
 meson.build                  |  3 +++
 src/backend/port/filler.c    | 29 +++++++++++++++++++++++++++++
 src/backend/port/meson.build |  3 +++
 3 files changed, 35 insertions(+)
 create mode 100644 src/backend/port/filler.c

diff --git a/meson.build b/meson.build
index bfacbdc0af..450946370c 100644
--- a/meson.build
+++ b/meson.build
@@ -239,6 +239,9 @@ elif host_system == 'freebsd'
 elif host_system == 'linux'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
+  # WIP: debug builds are huge
+  # TODO: add portability check
+  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']
 
 elif host_system == 'netbsd'
   # We must resolve all dynamic linking in the core server at program start.
diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
new file mode 100644
index 0000000000..de4e33bb05
--- /dev/null
+++ b/src/backend/port/filler.c
@@ -0,0 +1,29 @@
+/*
+ * Add enough padding to .text segment to bring the end just
+ * past a 2MB alignment boundary. In practice, this means .text needs
+ * to be at least 8MB. It shouldn't be much larger than this,
+ * because then more hot pages will remain in 4kB pages.
+ *
+ * FIXME: With this filler added, if explicit huge pages are turned off
+ * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
+ * instead of reporting failure if the .text segment is larger than 8MB.
+ *
+ * See MapStaticCodeToLargePages() in large_page.c
+ *
+ * XXX: The exact amount of filler must be determined experimentally
+ * on platforms of interest, in non-assert builds.
+ *
+ */
+static void
+__attribute__((used))
+__attribute__((cold))
+fill_function(int x)
+{
+	/* TODO: More architectures */
+#ifdef __x86_64__
+__asm__(
+	".fill 3251000"
+);
+#endif
+	(void) x;
+}
\ No newline at end of file
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 5ab65115e9..d876712e0c 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  if host_system == 'linux'
+    backend_sources += files('filler.c')
+  endif
   backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
-- 
2.37.3

v1-0001-Partly-remap-the-.text-segment-into-huge-pages-at.patchapplication/x-patch; name=v1-0001-Partly-remap-the-.text-segment-into-huge-pages-at.patchDownload
From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at
 postmaster start

Based on MIT licensed libary at https://github.com/intel/iodlr

The basic steps are:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap temporary region and memcpy aligned portion of .text segment
- mmap start address to new region with huge pages and MAP_FIXED
- memcpy over and revoke the PROT_WRITE bit

The Postgres .text segment is ~5.0MB in a non-assert build, so this
method can put 2-4MB into huge pages.
---
 src/backend/port/large_page.c       | 348 ++++++++++++++++++++++++++++
 src/backend/port/meson.build        |   1 +
 src/backend/postmaster/postmaster.c |   7 +
 src/include/port/large_page.h       |  18 ++
 4 files changed, 374 insertions(+)
 create mode 100644 src/backend/port/large_page.c
 create mode 100644 src/include/port/large_page.h

diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c
new file mode 100644
index 0000000000..66a584f785
--- /dev/null
+++ b/src/backend/port/large_page.c
@@ -0,0 +1,348 @@
+/*-------------------------------------------------------------------------
+ *
+ * large_page.c
+ *	  Map .text segment of binary to huge pages
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/backend/port/large_page.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Based on Intel ioldr library:
+ * https://github.com/intel/iodlr.git
+ * MIT license and copyright notice follows
+ */
+
+/*
+ * Copyright (C) 2018 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
+ * the Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included
+ * in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
+ * OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+ * OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * SPDX-License-Identifier: MIT
+ */
+
+#include "postgres.h"
+
+#include <link.h>
+#include <sys/mman.h>
+
+#include "port/large_page.h"
+#include "storage/pg_shmem.h"
+
+typedef struct
+{
+	char	   *from;
+	char	   *to;
+}			mem_range;
+
+typedef struct
+{
+	uintptr_t	start;
+	uintptr_t	end;
+	bool		found;
+}			FindParams;
+
+static inline uintptr_t
+largepage_align_down(uintptr_t addr, size_t hugepagesize)
+{
+	return (addr & ~(hugepagesize - 1));
+}
+
+static inline uintptr_t
+largepage_align_up(uintptr_t addr, size_t hugepagesize)
+{
+	return largepage_align_down(addr + hugepagesize - 1, hugepagesize);
+}
+
+static bool
+FindTextSection(const char *fname, ElfW(Shdr) * text_section)
+{
+	ElfW(Ehdr) ehdr;
+	FILE	   *bin;
+
+	ElfW(Shdr) * shdrs = NULL;
+	ElfW(Shdr) * sh_strab;
+	char	   *section_names = NULL;
+
+	bin = fopen(fname, "r");
+	if (bin == NULL)
+		return false;
+
+	/* Read the header. */
+	if (fread(&ehdr, sizeof(ehdr), 1, bin) != 1)
+		return false;;
+
+	/* Read the section headers. */
+	shdrs = (ElfW(Shdr) *) palloc(ehdr.e_shnum * sizeof(ElfW(Shdr)));
+	if (fseek(bin, ehdr.e_shoff, SEEK_SET) != 0)
+		return false;;
+	if (fread(shdrs, sizeof(shdrs[0]), ehdr.e_shnum, bin) != ehdr.e_shnum)
+		return false;;
+
+	/* Read the string table. */
+	sh_strab = &shdrs[ehdr.e_shstrndx];
+	section_names = palloc(sh_strab->sh_size * sizeof(char));
+
+	if (fseek(bin, sh_strab->sh_offset, SEEK_SET) != 0)
+		return false;;
+	if (fread(section_names, sh_strab->sh_size, 1, bin) != 1)
+		return false;;
+
+	/* Find the ".text" section. */
+	for (uint32_t idx = 0; idx < ehdr.e_shnum; idx++)
+	{
+		ElfW(Shdr) * sh = &shdrs[idx];
+		if (!memcmp(&section_names[sh->sh_name], ".text", 5))
+		{
+			*text_section = *sh;
+			fclose(bin);
+			return true;
+		}
+	}
+	return false;
+}
+
+/* Callback for dl_iterate_phdr to set the start and end of the .text segment */
+static int
+FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
+{
+	ElfW(Shdr) text_section;
+	FindParams *find_params = (FindParams *) data;
+
+	/*
+	 * We are only interested in the mapping matching the main executable.
+	 * This has the empty string for a name.
+	 */
+	if (hdr->dlpi_name[0] != '\0')
+		return 0;
+
+	/*
+	 * Open the info structure for the executable on disk to find the location
+	 * of its .text section. We use the base address given to calculate the
+	 * .text section offset in memory.
+	 */
+	text_section.sh_size = 0;
+#ifdef __linux__
+	if (FindTextSection("/proc/self/exe", &text_section))
+	{
+		find_params->start = hdr->dlpi_addr + text_section.sh_addr;
+		find_params->end = find_params->start + text_section.sh_size;
+		find_params->found = true;
+		return 1;
+	}
+#endif
+	return 0;
+}
+
+/*
+ * Identify and return the text segment in the currently mapped memory region.
+ */
+static bool
+FindTextRegion(mem_range * region)
+{
+	FindParams	find_params = {0, 0, false};
+
+	/*
+	 * Note: the upstream source worked with shared libraries as well, hence
+	 * the iteration over all ojects.
+	 */
+	dl_iterate_phdr(FindMapping, &find_params);
+	if (find_params.found)
+	{
+		region->from = (char *) find_params.start;
+		region->to = (char *) find_params.end;
+	}
+
+	return find_params.found;
+}
+
+/*
+ * Move specified region to large pages.
+ *
+ * NB: We need to be very careful:
+ * 1. This function itself should not be moved. We use compiler attributes:
+ *      WIP: if these aren't available, the function should do nothing
+ * (__section__) to put it outside the ".text" section
+ * (__noline__) to not inline this function
+ *
+ * 2. This function should not call any function(s) that might be moved.
+ */
+static void
+__attribute__((__section__("lpstub")))
+__attribute__((__noinline__))
+MoveRegionToLargePages(const mem_range * r, int mmap_flags)
+{
+	void	   *nmem = MAP_FAILED;
+	void	   *tmem = MAP_FAILED;
+	int			ret = 0;
+	int			mmap_errno = 0;
+	void	   *start = r->from;
+	size_t		size = r->to - r->from;
+	bool		success = false;
+
+	/* Allocate temporary region */
+	nmem = mmap(NULL, size,
+				PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (nmem == MAP_FAILED)
+	{
+		elog(DEBUG1, "failed to allocate temporary region");
+		return;
+	}
+
+	/* copy the original code */
+	memcpy(nmem, r->from, size);
+
+	/*
+	 * mmap using the start address with MAP_FIXED so we get exactly the same
+	 * virtual address. We already know the original page is r-xp (PROT_READ,
+	 * PROT_EXEC, MAP_PRIVATE) We want PROT_WRITE because we are writing into
+	 * it.
+	 */
+	Assert(mmap_flags & MAP_HUGETLB);
+	tmem = mmap(start, size,
+				PROT_READ | PROT_WRITE | PROT_EXEC,
+				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | mmap_flags,
+				-1, 0);
+	mmap_errno = errno;
+
+	if (tmem == MAP_FAILED && huge_pages == HUGE_PAGES_ON)
+	{
+		/*
+		 * WIP: need a way for the user to determine total huge pages needed,
+		 * perhaps with shared_memory_size_in_huge_pages
+		 */
+		errno = mmap_errno;
+		ereport(FATAL,
+				errmsg("mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size),
+				(mmap_errno == ENOMEM) ?
+				errhint("This usually means not enough explicit huge pages were "
+						"configured in the kernel") : 0);
+		goto cleanup_tmp;
+	}
+	else if (tmem == MAP_FAILED)
+	{
+		Assert(huge_pages == HUGE_PAGES_TRY);
+
+		errno = mmap_errno;
+		elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size);
+
+		/*
+		 * try remapping again with normal pages
+		 *
+		 * XXX we cannot just back out now
+		 */
+		tmem = mmap(start, size,
+					PROT_READ | PROT_WRITE | PROT_EXEC,
+					MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+					-1, 0);
+		mmap_errno = errno;
+
+		if (tmem == MAP_FAILED)
+		{
+			/*
+			 * If we get here we cannot start the server. It's unlikely we
+			 * will fail here after the postmaster successfully set up shared
+			 * memory, but maybe we should have a GUC to turn off code
+			 * remapping, hinted here.
+			 */
+			errno = mmap_errno;
+			ereport(FATAL,
+					errmsg("mmap(%zu) failed for fallback code region: %m", size));
+			goto cleanup_tmp;
+		}
+	}
+	else
+		success = true;
+
+	/* copy the code to the newly mapped area and unset the write bit */
+	memcpy(start, nmem, size);
+	ret = mprotect(start, size, PROT_READ | PROT_EXEC);
+	if (ret < 0)
+	{
+		/* WIP: see note above about GUC and hint */
+		ereport(FATAL,
+				errmsg("failed to protect remapped code pages"));
+
+		/* Cannot start but at least try to clean up after ourselves */
+		munmap(tmem, size);
+		goto cleanup_tmp;
+	}
+
+	if (success)
+		elog(DEBUG1, "binary mapped to huge pages");
+
+cleanup_tmp:
+	/* Release the old/temporary mapped region */
+	elog(DEBUG3, "un-mmapping temporary code region");
+	ret = munmap(nmem, size);
+	if (ret < 0)
+		/* WIP: not sure of severity here */
+		ereport(LOG,
+				errmsg("failed to unmap temporary region"));
+
+	return;
+}
+
+/*  Align the region to to be mapped to huge page boundaries. */
+static void
+AlignRegionToPageBoundary(mem_range * r, size_t hugepagesize)
+{
+	r->from = (char *) largepage_align_up((uintptr_t) r->from, hugepagesize);
+	r->to = (char *) largepage_align_down((uintptr_t) r->to, hugepagesize);
+}
+
+
+/*  Map the postgres .text segment into huge pages. */
+void
+MapStaticCodeToLargePages(void)
+{
+	size_t		hugepagesize;
+	int			mmap_flags;
+	mem_range	r = {0};
+
+	if (huge_pages == HUGE_PAGES_OFF)
+		return;
+
+	GetHugePageSize(&hugepagesize, &mmap_flags);
+	if (hugepagesize == 0)
+		return;
+
+	FindTextRegion(&r);
+	if (r.from == NULL || r.to == NULL)
+		return;
+
+	elog(DEBUG3, ".text start: %p", r.from);
+	elog(DEBUG3, ".text end:   %p", r.to);
+
+	AlignRegionToPageBoundary(&r, hugepagesize);
+
+	elog(DEBUG3, "aligned .text start: %p", r.from);
+	elog(DEBUG3, "aligned .text end:   %p", r.to);
+
+	/* check if aligned map region is large enough for huge pages */
+	if (r.to - r.from < hugepagesize || r.from > r.to)
+		return;
+
+	MoveRegionToLargePages(&r, mmap_flags);
+}
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index a22c25dd95..5ab65115e9 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,7 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 30fb576ac3..b30769c2b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -106,6 +106,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
+#include "port/large_page.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgworker_internals.h"
@@ -1084,6 +1085,12 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * If enough huge pages are available after setting up shared memory, try
+	 * to map the binary code to huge pages.
+	 */
+	MapStaticCodeToLargePages();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/include/port/large_page.h b/src/include/port/large_page.h
new file mode 100644
index 0000000000..171819dd53
--- /dev/null
+++ b/src/include/port/large_page.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * large_page.h
+ *	  Map .text segment of binary to huge pages
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/include/port/large_page.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LARGE_PAGE_H
+#define LARGE_PAGE_H
+
+extern void MapStaticCodeToLargePages(void);
+
+#endif							/* LARGE_PAGE_H */
-- 
2.37.3

#2Andres Freund
andres@anarazel.de
In reply to: John Naylor (#1)
Re: remap the .text segment into huge pages at run time

Hi,

On 2022-11-02 13:32:37 +0700, John Naylor wrote:

It's been known for a while that Postgres spends a lot of time translating
instruction addresses, and using huge pages in the text segment yields a
substantial performance boost in OLTP workloads [1][2].

Indeed. Some of that we eventually should address by making our code less
"jumpy", but that's a large amount of work and only going to go so far.

The difficulty is,
this normally requires a lot of painstaking work (unless your OS does
superpage promotion, like FreeBSD).

I still am confused by FreeBSD being able to do this without changing the
section alignment to be big enough. Or is the default alignment on FreeBSD
large enough already?

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting sufficiently
aligned executable pages to huge pages too, and might have succeeded already.

IOW, adding the linker flags might be a good first step.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text
segment
- mmap aligned start address to a second region with huge pages and
MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

Would mremap()'ing the temporary region also work? That might be simpler and
more robust (you'd see the MAP_HUGETLB failure before doing anything
irreversible). And you then might not even need this:

The reason this doesn't "saw off the branch you're standing on" is that the
remapping is done in a function that's forced to live in a different
segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

This would likely need a bunch more gating than the patch, understandably,
has. I think it'd faily horribly if there were .text relocations, for example?
I think there are some architectures that do that by default...

0002 is my attempt to force the linker's hand and get the entire text
segment mapped to huge pages. It's quite a finicky hack, and easily broken
(see below). That said, it still builds easily within our normal build
process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152
-Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
done for predictability, but that means the next 2MB boundary is very
nearly 2MB away.

Yep. FWIW, my notes say

# align sections to 2MB boundaries for hugepage support
# bfd and gold linkers:
# -Wl,-zmax-page-size=0x200000 -Wl,-zcommon-page-size=0x200000
# lld:
# -Wl,-zmax-page-size=0x200000 -Wl,-z,separate-loadable-segments
# then copy binary to tmpfs mounted with -o huge=always

I.e. with lld you need slightly different flags -Wl,-z,separate-loadable-segments

The meson bit should probably just use
cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'])

Afaict there's really no reason to not do that by default, allowing kernels
that can promote to huge pages to do so.

My approach to forcing huge pages to be used was to then:

# copy binary to tmpfs mounted with -o huge=always

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

I don't understand why this is needed - as long as the pages are aligned to
2MB, why do we need to fill things up on disk? The in-memory contents are the
relevant bit, no?

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
causes complete startup failure if the .text segment is larger than 8MB.

I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
independent of the .text segment size?

+/* Callback for dl_iterate_phdr to set the start and end of the .text segment */
+static int
+FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
+{
+	ElfW(Shdr) text_section;
+	FindParams *find_params = (FindParams *) data;
+
+	/*
+	 * We are only interested in the mapping matching the main executable.
+	 * This has the empty string for a name.
+	 */
+	if (hdr->dlpi_name[0] != '\0')
+		return 0;
+

It's not entirely clear we'd only ever want to do this for the main
executable. E.g. plpgsql could also benefit.

diff --git a/meson.build b/meson.build
index bfacbdc0af..450946370c 100644
--- a/meson.build
+++ b/meson.build
@@ -239,6 +239,9 @@ elif host_system == 'freebsd'
elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
+  # WIP: debug builds are huge
+  # TODO: add portability check
+  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']

What's that WIP about?

elif host_system == 'netbsd'
# We must resolve all dynamic linking in the core server at program start.
diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
new file mode 100644
index 0000000000..de4e33bb05
--- /dev/null
+++ b/src/backend/port/filler.c
@@ -0,0 +1,29 @@
+/*
+ * Add enough padding to .text segment to bring the end just
+ * past a 2MB alignment boundary. In practice, this means .text needs
+ * to be at least 8MB. It shouldn't be much larger than this,
+ * because then more hot pages will remain in 4kB pages.
+ *
+ * FIXME: With this filler added, if explicit huge pages are turned off
+ * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
+ * instead of reporting failure if the .text segment is larger than 8MB.
+ *
+ * See MapStaticCodeToLargePages() in large_page.c
+ *
+ * XXX: The exact amount of filler must be determined experimentally
+ * on platforms of interest, in non-assert builds.
+ *
+ */
+static void
+__attribute__((used))
+__attribute__((cold))
+fill_function(int x)
+{
+	/* TODO: More architectures */
+#ifdef __x86_64__
+__asm__(
+	".fill 3251000"
+);
+#endif
+	(void) x;
+}
\ No newline at end of file
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 5ab65115e9..d876712e0c 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
endif
if cdata.has('USE_SYSV_SHARED_MEMORY')
+  if host_system == 'linux'
+    backend_sources += files('filler.c')
+  endif
backend_sources += files('large_page.c')
backend_sources += files('sysv_shmem.c')
endif
-- 
2.37.3

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: remap the .text segment into huge pages at run time

Hi,

This nerd-sniped me badly :)

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:

On 2022-11-02 13:32:37 +0700, John Naylor wrote:

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting sufficiently
aligned executable pages to huge pages too, and might have succeeded already.

IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never

FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 0x555555800,0x555556122
sudo /home/andres/src/kernel/tools/vm/page-types -f /srv/dev/build/m-opt/src/backend/postgres2

Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g pgbench -n -M prepared -S -P1 -c$c -j$c -T10

without MADV_COLLAPSE:

tps = 1038230.070771 (without initial connection time)

Performance counter stats for 'system wide':

1,184,344,476,152 cycles (71.41%)
2,846,146,710 iTLB-loads (71.43%)
2,021,885,782 iTLB-load-misses # 71.04% of all iTLB cache accesses (71.44%)
75,633,850,933 itlb_misses.walk_active (71.44%)
2,020,962,930 itlb_misses.walk_completed_4k (71.44%)
1,213,368 itlb_misses.walk_completed_2m_4m (57.12%)
2,293 itlb_misses.walk_completed_1g (57.11%)

10.064352587 seconds time elapsed

with MADV_COLLAPSE:

tps = 1113717.114278 (without initial connection time)

Performance counter stats for 'system wide':

1,173,049,140,611 cycles (71.42%)
1,059,224,678 iTLB-loads (71.44%)
653,603,712 iTLB-load-misses # 61.71% of all iTLB cache accesses (71.44%)
26,135,902,949 itlb_misses.walk_active (71.44%)
628,314,285 itlb_misses.walk_completed_4k (71.44%)
25,462,916 itlb_misses.walk_completed_2m_4m (57.13%)
2,228 itlb_misses.walk_completed_1g (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.

A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 100000 / -t
10000 to compare the same amount of work) I get:

without MADV_COLLAPSE:

not pipelined:

tps = 1037732.722805 (without initial connection time)

Performance counter stats for 'system wide':

1,691,411,678,007 cycles (62.48%)
8,856,107 itlb.itlb_flush (62.48%)
4,600,041,062 iTLB-loads (62.48%)
2,598,218,236 iTLB-load-misses # 56.48% of all iTLB cache accesses (62.50%)
100,095,862,126 itlb_misses.walk_active (62.53%)
2,595,376,025 itlb_misses.walk_completed_4k (50.02%)
2,558,713 itlb_misses.walk_completed_2m_4m (50.00%)
2,146 itlb_misses.walk_completed_1g (49.98%)

14.582927646 seconds time elapsed

pipelined:

tps = 161947.008995 (without initial connection time)

Performance counter stats for 'system wide':

1,095,948,341,745 cycles (62.46%)
877,556 itlb.itlb_flush (62.46%)
4,576,237,561 iTLB-loads (62.48%)
307,971,166 iTLB-load-misses # 6.73% of all iTLB cache accesses (62.52%)
15,565,279,213 itlb_misses.walk_active (62.55%)
306,240,104 itlb_misses.walk_completed_4k (50.03%)
1,753,560 itlb_misses.walk_completed_2m_4m (50.00%)
2,189 itlb_misses.walk_completed_1g (49.96%)

9.374687885 seconds time elapsed

with MADV_COLLAPSE:

not pipelined:
tps = 1112040.859643 (without initial connection time)

Performance counter stats for 'system wide':

1,569,546,236,696 cycles (62.50%)
7,094,291 itlb.itlb_flush (62.51%)
1,599,845,097 iTLB-loads (62.51%)
692,042,864 iTLB-load-misses # 43.26% of all iTLB cache accesses (62.51%)
31,529,641,124 itlb_misses.walk_active (62.51%)
669,849,177 itlb_misses.walk_completed_4k (49.99%)
22,708,146 itlb_misses.walk_completed_2m_4m (49.99%)
2,752 itlb_misses.walk_completed_1g (49.99%)

13.611206182 seconds time elapsed

pipelined:

tps = 162484.443469 (without initial connection time)

Performance counter stats for 'system wide':

1,092,897,514,658 cycles (62.48%)
942,351 itlb.itlb_flush (62.48%)
233,996,092 iTLB-loads (62.48%)
102,155,575 iTLB-load-misses # 43.66% of all iTLB cache accesses (62.49%)
6,419,597,286 itlb_misses.walk_active (62.52%)
98,758,409 itlb_misses.walk_completed_4k (50.03%)
3,342,332 itlb_misses.walk_completed_2m_4m (50.02%)
2,190 itlb_misses.walk_completed_1g (49.98%)

9.355239897 seconds time elapsed

The difference in itlb.itlb_flush between pipelined / non-pipelined cases
unsurprisingly is stark.

While the pipelined case still sees a good bit reduced itlb traffic, the total
amount of cycles in which a walk is active is just not large enough to matter,
by the looks of it.

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: remap the .text segment into huge pages at run time

Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

I don't understand why this is needed - as long as the pages are aligned to
2MB, why do we need to fill things up on disk? The in-memory contents are the
relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000000c7f58 0x00000000000c7f58 R 0x200000
LOAD 0x0000000000200000 0x0000000000200000 0x0000000000200000
0x0000000000921d39 0x0000000000921d39 R E 0x200000
LOAD 0x0000000000c00000 0x0000000000c00000 0x0000000000c00000
0x00000000002626b8 0x00000000002626b8 R 0x200000
LOAD 0x0000000000fdf510 0x00000000011df510 0x00000000011df510
0x0000000000037fd6 0x000000000006a310 RW 0x200000

gold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,--rosegment
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000009230f9 0x00000000009230f9 R E 0x200000
LOAD 0x0000000000a00000 0x0000000000a00000 0x0000000000a00000
0x000000000033a738 0x000000000033a738 R 0x200000
LOAD 0x0000000000ddf4e0 0x0000000000fdf4e0 0x0000000000fdf4e0
0x000000000003800a 0x000000000006a340 RW 0x200000

lld: -Wl,-zmax-page-size=0x200000,-zseparate-loadable-segments
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000033710c 0x000000000033710c R 0x200000
LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000
0x0000000000921cb0 0x0000000000921cb0 R E 0x200000
LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
0x0000000000020ae0 0x0000000000020ae0 RW 0x200000
LOAD 0x0000000001000000 0x0000000001000000 0x0000000001000000
0x00000000000174ea 0x0000000000049820 RW 0x200000

mold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,-zseparate-loadable-segments
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000032dde9 0x000000000032dde9 R 0x200000
LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000
0x0000000000921cbe 0x0000000000921cbe R E 0x200000
LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
0x00000000002174e8 0x0000000000249820 RW 0x200000

With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.

MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

Performance counter stats for 'system wide':

1,169,012,466,070 cycles (55.53%)
729,146,640,019 instructions # 0.62 insn per cycle (66.65%)
7,062,923 itlb.itlb_flush (66.65%)
1,041,825,587 iTLB-loads (66.65%)
634,272,420 iTLB-load-misses # 60.88% of all iTLB cache accesses (66.66%)
27,018,254,873 itlb_misses.walk_active (66.68%)
610,639,252 itlb_misses.walk_completed_4k (44.47%)
24,262,549 itlb_misses.walk_completed_2m_4m (44.46%)
2,948 itlb_misses.walk_completed_1g (44.43%)

10.039217004 seconds time elapsed

MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

Performance counter stats for 'system wide':

1,173,272,878,934 cycles (55.53%)
746,008,850,147 instructions # 0.64 insn per cycle (66.65%)
7,538,962 itlb.itlb_flush (66.65%)
799,861,088 iTLB-loads (66.65%)
254,347,048 iTLB-load-misses # 31.80% of all iTLB cache accesses (66.66%)
14,427,296,885 itlb_misses.walk_active (66.69%)
221,811,835 itlb_misses.walk_completed_4k (44.47%)
32,881,405 itlb_misses.walk_completed_2m_4m (44.46%)
3,043 itlb_misses.walk_completed_1g (44.43%)

10.038517778 seconds time elapsed

compared to a run without any huge pages (via THP or MADV_COLLAPSE):

tps = 1034960.102843 (without initial connection time)

Performance counter stats for 'system wide':

1,183,743,785,066 cycles (55.54%)
678,525,810,443 instructions # 0.57 insn per cycle (66.65%)
7,163,304 itlb.itlb_flush (66.65%)
2,952,660,798 iTLB-loads (66.65%)
2,105,431,590 iTLB-load-misses # 71.31% of all iTLB cache accesses (66.66%)
80,593,535,910 itlb_misses.walk_active (66.68%)
2,105,377,810 itlb_misses.walk_completed_4k (44.46%)
1,254,156 itlb_misses.walk_completed_2m_4m (44.46%)
3,366 itlb_misses.walk_completed_1g (44.44%)

10.039821650 seconds time elapsed

So a 7.96% win from no-huge-pages to MADV_COLLAPSE and a further 2.11% win
from there to also using mremap(), yielding a total of 10.23%. It's similar
across runs.

On my system the other libraries unfortunately aren't aligned properly. It'd
be nice to also remap at least libc. The majority of the remaining misses are
from the vdso (too small for a huge page), libc (not aligned properly),
returning from system calls (which flush the itlb) and pgbench / libpq (I
didn't add the mremap there, there's not enough code for a huge page without
it).

Greetings,

Andres Freund

#5John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#2)
Re: remap the .text segment into huge pages at run time

On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting

sufficiently

aligned executable pages to huge pages too, and might have succeeded

already.

IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped
before it does so. At least, that issue is mentioned in that paper linked
upthread for FreeBSD.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages

for

it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just

hardcode

the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the

executable

doesn't reflinks to reuse parts of other files, and that the mold linker

and

cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

The difference in itlb.itlb_flush between pipelined / non-pipelined cases
unsurprisingly is stark.

While the pipelined case still sees a good bit reduced itlb traffic, the

total

amount of cycles in which a walk is active is just not large enough to

matter,

by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something
devs should consider when microbenchmarking, to reduce noise from context
switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:

- Add a "cold" __asm__ filler function that just takes up space,

enough to

push the end of the .text segment over the next aligned boundary, or

to

~8MB in size.

I don't understand why this is needed - as long as the pages are

aligned to

2MB, why do we need to fill things up on disk? The in-memory contents

are the

relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello
world" program, which links with 8 filler objects, each with 32768
__attribute((used)) dummy functions. I just cargo-culted that idea and
simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch
have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with

MAP_HUGETLB

causes complete startup failure if the .text segment is larger than 8MB.

I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text >
8MB (I have yet to enable core dumps on this new OS I have...), whereas
without it I did see the failures in the log, and successful fallback.

With these flags the "R E" segments all start on a 0x200000/2MiB boundary

and

are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do

an

mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com

#6Andres Freund
andres@anarazel.de
In reply to: John Naylor (#5)
Re: remap the .text segment into huge pages at run time

Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:

On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres@anarazel.de> wrote:

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just

hardcode

the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the

executable

doesn't reflinks to reuse parts of other files, and that the mold linker

and

cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:

- Add a "cold" __asm__ filler function that just takes up space,

enough to

push the end of the .text segment over the next aligned boundary, or

to

~8MB in size.

I don't understand why this is needed - as long as the pages are

aligned to

2MB, why do we need to fill things up on disk? The in-memory contents

are the

relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello
world" program, which links with 8 filler objects, each with 32768
__attribute((used)) dummy functions. I just cargo-culted that idea and
simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.

I quickly tried to align the segments with the linker and then in my patch
have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.

With these flags the "R E" segments all start on a 0x200000/2MiB boundary

and

are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do

an

mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment. Here's my hacky code, with a bunch of
comments added.

void *addr = (void*) 0x555555800000;
void *end = (void *) 0x555555e09000;
size_t advlen = (uintptr_t) end - (uintptr_t) addr;

const size_t bound = 1024*1024*2 - 1;
size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
void *r2;

/*
* Increase size of mapping to cover the tailing padding to the next
* segment. Otherwise all the code in that range can't be put into
* a huge page (access in the non-mapped range needs to cause a fault,
* hence can't be in the huge page).
* XXX: Should proably assert that that space is actually zeroes.
*/
r2 = mremap(addr, advlen, advlen_up, 0);
if (r2 == MAP_FAILED)
fprintf(stderr, "mremap failed: %m\n");
else if (r2 != addr)
fprintf(stderr, "mremap wrong addr: %m\n");
else
advlen = advlen_up;

/*
* The docs for MADV_COLLAPSE say there should be at least one page
* in the mapped space "for every eligible hugepage-aligned/sized
* region to be collapsed". I just forced that. But probably not
* necessary.
*/
r = madvise(addr, advlen, MADV_WILLNEED);
if (r != 0)
fprintf(stderr, "MADV_WILLNEED failed: %m\n");

r = madvise(addr, advlen, MADV_POPULATE_READ);
if (r != 0)
fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

/*
* Make huge pages out of it. Requires at least linux 6.1. We could
* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
* much in older kernels.
*/
#define MADV_COLLAPSE 25
r = madvise(addr, advlen, MADV_COLLAPSE);
if (r != 0)
fprintf(stderr, "MADV_COLLAPSE failed: %m\n");

A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're suitably
aligned (both in memory and on-disk).

Greetings,

Andres Freund

#7John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#6)
Re: remap the .text segment into huge pages at run time

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I
should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES=              \
   filler16.o           \

OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

With these flags the "R E" segments all start on a 0x200000/2MiB

boundary

and

are padded to the next 2MiB boundary. However the OS / dynamic loader

only

maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it

do

an

mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

/*
* Make huge pages out of it. Requires at least linux 6.1. We

could

* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all

that

* much in older kernels.
*/

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems. If explicit
HP's are used for shared mem, maybe THP is less of a risk? I need to look
back at the tests that led to that advice...

A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

postgres' r-xp mapping. We could do it for libraries too, if they're

suitably

aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

I quickly tried to align the segments with the linker and then in my

patch

have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags

for

some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
# Align the loadable segments to 2MB boundaries to support remapping to
# huge pages.
ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'
])

According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
[ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

--
John Naylor
EDB: http://www.enterprisedb.com

#8Andres Freund
andres@anarazel.de
In reply to: John Naylor (#7)
Re: remap the .text segment into huge pages at run time

Hi,

On 2022-11-06 13:56:10 +0700, John Naylor wrote:

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.

With these flags the "R E" segments all start on a 0x200000/2MiB

boundary

and

are padded to the next 2MiB boundary. However the OS / dynamic loader

only

maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it

do

an

mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

/*
* Make huge pages out of it. Requires at least linux 6.1. We

could

* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all

that

* much in older kernels.
*/

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise. My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'.

If explicit HP's are used for shared mem, maybe THP is less of a risk? I
need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.

A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

Cool.

postgres' r-xp mapping. We could do it for libraries too, if they're

suitably

aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.

I quickly tried to align the segments with the linker and then in my

patch

have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags

for

some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
# Align the loadable segments to 2MB boundaries to support remapping to
# huge pages.
ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'
])

According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, because the commandline ends up being long enough for a
response file being used).

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
[ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

That's added by the compiler, not the linker. See e.g.:

$ readelf --string-dump .comment src/backend/postgres_lib.a.p/storage_ipc_procarray.c.o

String dump of section '.comment':
[ 1] GCC: (Debian 12.2.0-9) 12.2.0

Greetings,

Andres Freund

#9John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#6)
Re: remap the .text segment into huge pages at run time

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

/*
* Make huge pages out of it. Requires at least linux 6.1. We

could

* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all

that

* much in older kernels.
*/
#define MADV_COLLAPSE 25
r = madvise(addr, advlen, MADV_COLLAPSE);
if (r != 0)
fprintf(stderr, "MADV_COLLAPSE failed: %m\n");

A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're

suitably

aligned (both in memory and on-disk).

Hi Andres, my kernel has been new enough for a while now, and since TLBs
and context switches came up in the thread on... threads, I'm swapping this
back in my head.

For the postmaster, it should be simple to have a function that just takes
the address of itself, then parses /proc/self/maps to find the boundaries
within which it lies. I haven't thought about libraries much. Though with
just the postmaster it seems that would give us the biggest bang for the
buck?

--
John Naylor
EDB: http://www.enterprisedb.com

#10Andres Freund
andres@anarazel.de
In reply to: John Naylor (#9)
Re: remap the .text segment into huge pages at run time

Hi,

On 2023-06-14 12:40:18 +0700, John Naylor wrote:

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

/*
* Make huge pages out of it. Requires at least linux 6.1. We

could

* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all

that

* much in older kernels.
*/
#define MADV_COLLAPSE 25
r = madvise(addr, advlen, MADV_COLLAPSE);
if (r != 0)
fprintf(stderr, "MADV_COLLAPSE failed: %m\n");

A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're

suitably

aligned (both in memory and on-disk).

Hi Andres, my kernel has been new enough for a while now, and since TLBs
and context switches came up in the thread on... threads, I'm swapping this
back in my head.

Cool - I think we have some real potential for substantial wins around this.

For the postmaster, it should be simple to have a function that just takes
the address of itself, then parses /proc/self/maps to find the boundaries
within which it lies. I haven't thought about libraries much. Though with
just the postmaster it seems that would give us the biggest bang for the
buck?

I think that is the main bit, yes. We could just try to do this for the
libraries, but accept failure to do so?

Greetings,

Andres Freund

#11John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#9)
2 attachment(s)
Re: remap the .text segment into huge pages at run time

On Wed, Jun 14, 2023 at 12:40 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres@anarazel.de> wrote:

A real version would have to open /proc/self/maps and do this for at

least

postgres' r-xp mapping. We could do it for libraries too, if they're

suitably

aligned (both in memory and on-disk).

For the postmaster, it should be simple to have a function that just

takes the address of itself, then parses /proc/self/maps to find the
boundaries within which it lies. I haven't thought about libraries much.
Though with just the postmaster it seems that would give us the biggest
bang for the buck?

Here's a start at that, trying with postmaster only. Unfortunately, I get
"MADV_COLLAPSE failed: Invalid argument". I tried different addresses with
no luck, and also got the same result with a small standalone program. I'm
on ext4, so I gather I don't need "cp --reflink=never" but tried it anyway.
Configuration looks normal by "grep HUGEPAGE /boot/config-$(uname
-r)". Maybe there's something obvious I'm missing?

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v2-0002-Attmept-to-remap-the-.text-segment-into-huge-page.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Attmept-to-remap-the-.text-segment-into-huge-page.patchDownload
From ca38a370e866d27c8b51c83f8f18bdda1587b3df Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v2 2/2] Attmept to remap the .text segment into huge pages at
 postmaster start

Use MADV_COLLAPSE advice, available since Linux kernel 6.1.

Andres Freund and John Naylor
---
 src/backend/port/huge_page.c        | 113 ++++++++++++++++++++++++++++
 src/backend/port/meson.build        |   4 +
 src/backend/postmaster/postmaster.c |   7 ++
 src/include/port/huge_page.h        |  18 +++++
 4 files changed, 142 insertions(+)
 create mode 100644 src/backend/port/huge_page.c
 create mode 100644 src/include/port/huge_page.h

diff --git a/src/backend/port/huge_page.c b/src/backend/port/huge_page.c
new file mode 100644
index 0000000000..92f87bb3c2
--- /dev/null
+++ b/src/backend/port/huge_page.c
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * huge_page.c
+ *	  Map .text segment of binary to huge pages
+ *
+ * TODO: better rationale for separate file if the huge page handling
+ * in sysv_shmem.c were moved here.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/backend/port/huge_page.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/mman.h>
+
+#include "port/huge_page.h"
+#include "storage/fd.h"
+
+/*
+ * Collapse specified memory range to huge pages.
+ */
+static void
+CollapseRegionToHugePages(void *addr, size_t advlen)
+{
+#ifdef __linux__
+	size_t advlen_up;
+	int r;
+	void *r2;
+	const size_t bound = 1024*1024*2; // FIXME: x86
+
+	fprintf(stderr, "old advlen: %lx\n", advlen);
+	advlen_up = (advlen + bound - 1) & ~(bound - 1);
+
+	/*
+	* Increase size of mapping to cover the tailing padding to the next
+	* segment. Otherwise all the code in that range can't be put into
+	* a huge page (access in the non-mapped range needs to cause a fault,
+	* hence can't be in the huge page).
+	* XXX: Should proably assert that that space is actually zeroes.
+	*/
+	r2 = mremap(addr, advlen, advlen_up, 0);
+	if (r2 == MAP_FAILED)
+		fprintf(stderr, "mremap failed: %m\n");
+	else if (r2 != addr)
+		fprintf(stderr, "mremap wrong addr: %m\n");
+	else
+		advlen = advlen_up;
+
+	fprintf(stderr, "new advlen: %lx\n", advlen);
+
+	/*
+	* The docs for MADV_COLLAPSE say there should be at least one page
+	* in the mapped space "for every eligible hugepage-aligned/sized
+	* region to be collapsed". I just forced that. But probably not
+	* necessary.
+	*/
+	r = madvise(addr, advlen, MADV_WILLNEED);
+	if (r != 0)
+		fprintf(stderr, "MADV_WILLNEED failed: %m\n");
+
+	r = madvise(addr, advlen, MADV_POPULATE_READ);
+	if (r != 0)
+		fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");
+
+	/*
+	* Make huge pages out of it. Requires at least linux 6.1.  We could
+	* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
+	* much in older kernels.
+	*/
+	r = madvise(addr, advlen, MADV_COLLAPSE);
+	if (r != 0)
+	{
+		fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
+
+		r = madvise(addr, advlen, MADV_HUGEPAGE);
+		if (r != 0)
+			fprintf(stderr, "MADV_HUGEPAGE failed: %m\n");
+	}
+#endif
+}
+
+/*  Map the postgres .text segment into huge pages. */
+void
+MapStaticCodeToLargePages(void)
+{
+#ifdef __linux__
+	FILE	   *fp = AllocateFile("/proc/self/maps", "r");
+	char		buf[128]; // got this from code reading /proc/meminfo -- enough?
+	uintptr_t 	addr;
+	uintptr_t 	end;
+	void * 		self = &MapStaticCodeToLargePages;
+
+	if (fp)
+	{
+		while (fgets(buf, sizeof(buf), fp))
+		{
+			if (sscanf(buf, "%lx-%lx", &addr, &end) == 2 &&
+				addr <= (uintptr_t) self && (uintptr_t) self < end)
+			{
+				fprintf(stderr, "self: %p start: %lx end: %lx\n", self, addr, end);
+				CollapseRegionToHugePages((void *) addr, end - addr);
+				break;
+			}
+		}
+		FreeFile(fp);
+	}
+#endif
+}
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 8fa68a88aa..af4d0c7bb7 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -25,6 +25,10 @@ if cdata.has('USE_WIN32_SHARED_MEMORY')
   backend_sources += files('win32_shmem.c')
 endif
 
+if host_system == 'linux'
+  backend_sources += files('huge_page.c')
+endif
+
 if host_system == 'windows'
   subdir('win32')
 endif
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c49393fc5..216e8c5730 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -106,6 +106,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
+#include "port/huge_page.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgworker_internals.h"
@@ -1007,6 +1008,12 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	process_shared_preload_libraries();
 
+	/*
+	 * Try to map the binary code to huge pages. We do this just after
+	 * any shared libraries are preloaded for future-proofing.
+	 */
+	MapStaticCodeToLargePages();
+
 	/*
 	 * Initialize SSL library, if specified.
 	 */
diff --git a/src/include/port/huge_page.h b/src/include/port/huge_page.h
new file mode 100644
index 0000000000..171819dd53
--- /dev/null
+++ b/src/include/port/huge_page.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * large_page.h
+ *	  Map .text segment of binary to huge pages
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/include/port/large_page.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LARGE_PAGE_H
+#define LARGE_PAGE_H
+
+extern void MapStaticCodeToLargePages(void);
+
+#endif							/* LARGE_PAGE_H */
-- 
2.40.1

v2-0001-Align-loadable-segments-to-2MB-boundaries-on-Linu.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Align-loadable-segments-to-2MB-boundaries-on-Linu.patchDownload
From e8a0c8633e969ad45eef82b40460df6552e6e550 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 2 Dec 2022 14:58:21 +0700
Subject: [PATCH v2 1/2] Align loadable segments to 2MB boundaries on Linux

Prerequsite for using huge pages for the .text section
on that platform.

TODO: autoconf support

Andres Freund and John Naylor
---
 meson.build | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/meson.build b/meson.build
index 16b2e86646..32af8bf5c3 100644
--- a/meson.build
+++ b/meson.build
@@ -248,6 +248,13 @@ elif host_system == 'freebsd'
 elif host_system == 'linux'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
+  # Align the loadable segments to 2MB boundaries to support remapping to
+  # huge pages.
+  ldflags += cc.get_supported_link_arguments([
+    '-Wl,-zmax-page-size=0x200000',
+    '-Wl,-zcommon-page-size=0x200000',
+    '-Wl,-zseparate-loadable-segments'
+  ])
 
 elif host_system == 'netbsd'
   # We must resolve all dynamic linking in the core server at program start.
-- 
2.40.1

#12Andres Freund
andres@anarazel.de
In reply to: John Naylor (#11)
Re: remap the .text segment into huge pages at run time

Hi,

On 2023-06-20 10:23:14 +0700, John Naylor wrote:

Here's a start at that, trying with postmaster only. Unfortunately, I get
"MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Greetings,

Andres Freund

#13Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
Re: remap the .text segment into huge pages at run time

Hi,

On 2023-06-20 10:29:41 -0700, Andres Freund wrote:

On 2023-06-20 10:23:14 +0700, John Naylor wrote:

Here's a start at that, trying with postmaster only. Unfortunately, I get
"MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Which kernel version are you using? It looks like the issue I am hitting might
be specific to the in-development 6.4 kernel.

One thing I now remember, after trying older kernels, is that it looks like
one sometimes needs to call 'sync' to ensure the page cache data for the
executable is clean, before executing postgres.

Greetings,

Andres Freund

#14John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#13)
Re: remap the .text segment into huge pages at run time

On Wed, Jun 21, 2023 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-06-20 10:29:41 -0700, Andres Freund wrote:

On 2023-06-20 10:23:14 +0700, John Naylor wrote:

Here's a start at that, trying with postmaster only. Unfortunately, I

get

"MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Which kernel version are you using? It looks like the issue I am hitting

might

be specific to the in-development 6.4 kernel.

(Fedora 38) uname -r shows

6.3.7-200.fc38.x86_64

--
John Naylor
EDB: http://www.enterprisedb.com

#15Andres Freund
andres@anarazel.de
In reply to: John Naylor (#14)
Re: remap the .text segment into huge pages at run time

Hi,

On 2023-06-21 09:35:36 +0700, John Naylor wrote:

On Wed, Jun 21, 2023 at 12:46 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-06-20 10:29:41 -0700, Andres Freund wrote:

On 2023-06-20 10:23:14 +0700, John Naylor wrote:

Here's a start at that, trying with postmaster only. Unfortunately, I

get

"MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Which kernel version are you using? It looks like the issue I am hitting

might

be specific to the in-development 6.4 kernel.

(Fedora 38) uname -r shows

6.3.7-200.fc38.x86_64

FWIW, I bisected the bug I was encountering.

As far as I understand, it should not affect you, it was only merged into
6.4-rc1 and a fix is scheduled to be merged into 6.4 before its release. See
https://lore.kernel.org/all/ZJIWAvTczl0rHJBv@x1n/

So I am wondering if you're encountering a different kind of problem. As I
mentioned, I have observed that the pages need to be clean for this to
work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without
the sync it starts to work a while later (presumably when the kernel got
around to writing the data back).

without sync:

self: 0x563b2abf0a72 start: 563b2a800000 end: 563b2afe3000
old advlen: 7e3000
new advlen: 800000
MADV_COLLAPSE failed: Invalid argument

with sync:
self: 0x555c947f0a72 start: 555c94400000 end: 555c94be3000
old advlen: 7e3000
new advlen: 800000

Greetings,

Andres Freund

#16John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#15)
Re: remap the .text segment into huge pages at run time

On Wed, Jun 21, 2023 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:

So I am wondering if you're encountering a different kind of problem. As I
mentioned, I have observed that the pages need to be clean for this to
work. For me adding a "sync path/to/postgres" makes it work on 6.3.8.

Without

the sync it starts to work a while later (presumably when the kernel got
around to writing the data back).

Hmm, then after rebooting today, it shouldn't have that problem until a
build links again, but I'll make sure to do that when building. Still same
failure, though. Looking more closely at the manpage for madvise, it has
this under MADV_HUGEPAGE:

"The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations are
available only if the kernel was configured with
CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the
kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS."

Earlier, I only checked the first config option but didn't know about the
second...

$ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r)
# CONFIG_READ_ONLY_THP_FOR_FS is not set

Apparently, it's experimental. That could be the explanation, but now I'm
wondering why the fallback

madvise(addr, advlen, MADV_HUGEPAGE);

didn't also give an error. I wonder if we could mremap to some anonymous
region and call madvise on that. That would be more similar to the hack I
shared last year, which may be more fragile, but now it wouldn't
need explicit huge pages.

--
John Naylor
EDB: http://www.enterprisedb.com