Improve monitoring of shared memory allocations

Started by Rahila Syed11 months ago19 messages

rahilasyed90@gmail.com

11 months ago

2 attachment(s)

Hi,

The 0001* patch improved the accounting for the shared memory allocated for
a
hash table during hash_create.
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct, which,
for shared hash tables, only covers memory allocated for the hash
directory
and control structure via ShmemInitHash. The hash segments and buckets
are allocated using ShmemAllocNoError, which does not attribute the
allocation
to the hash table and also does not add it to ShmemIndex.

Therefore, these allocations are not tracked in pg_shmem_allocations.
To improve this, include the allocation of segments and buckets in the
initial
allocation of the shared memory for the hash table, in ShmemInitHash.

This will result in pg_shmem_allocations representing the total size of the
initial
hash table, including all the buckets and elements, instead of just the
directory
size.

Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
tracked by pg_shmem_allocations.
The 0002* patch replaces most of the calls to ShmemAlloc with
ShmemInitStruct
to associate a name with the allocations and ensure that they get tracked
by
pg_shmem_allocations.

I observed an improvement in total memory allocation by consolidating
initial shared
memory allocations for the hash table. For ex. the allocated size for the
LOCK hash
hash_create decreased from 801664 bytes to 799616 bytes. Please find the
attached
patches, which I will add to the March Commitfest.

Thank you,
Rahila Syed

Attachments:

0001-Account-for-initial-shared-memory-allocated-during-h.patchapplication/octet-stream; name=0001-Account-for-initial-shared-memory-allocated-during-h.patchDownload

From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Fri, 21 Feb 2025 15:08:12 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated during
 hash_create.

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Improve this include the alocation of segments and buckets or elements
in the initial allocation of shared hash directory. Since this adds numbers
to existing hash table entries, the resulting tuples in pg_shmem_allocations
represent the total size of the initial hash table including all the
buckets and the elements they contain, instead of just the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 146 ++++++++++++++++++++++--------
 src/include/utils/dynahash.h      |   2 +
 src/include/utils/hsearch.h       |   2 +-
 4 files changed, 111 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..c56e9b6c77 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_shared_size(infoP, hash_flags, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..5203f5b30b 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -120,7 +120,6 @@
  * a good idea of the maximum number of entries!).  For non-shared hash
  * tables, the initial directory size can be left at the default.
  */
-#define DEF_SEGSIZE			   256
 #define DEF_SEGSIZE_SHIFT	   8	/* must be log2(DEF_SEGSIZE) */
 #define DEF_DIRSIZE			   256
 
@@ -265,7 +264,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
 static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
 								  HASHBUCKET **bucketptr);
 static long next_pow2_long(long num);
-static int	next_pow2_int(long num);
 static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
-
+static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
 
 /*
  * memory allocation support
@@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	else
 		hashp->keycopy = memcpy;
 
-	/* And select the entry allocation function, too. */
+	/*
+	 * And select the entry allocation function, too. XXX should this also
+	 * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
+	 * currently only set with HASH_SHARED_MEM *
+	 */
 	if (flags & HASH_ALLOC)
 		hashp->alloc = info->alloc;
 	else
@@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
+	/* Initializing the HASHHDR variables with default values */
 	hdefault(hashp);
 
 	hctl = hashp->hctl;
@@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
-
+		void *curr_offset;
+	
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
 		 * the initial allocation.  Otherwise only freeList[0] is used.
@@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the
+		 * the first partition of elements
+		 */
+		if (hashp->isshared)
+		{
+			int			nsegs;
+			int			nbuckets;
+			nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+			
+			curr_offset =  (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		}
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
-
-			if (!element_alloc(hashp, temp, i))
+			HASHELEMENT *firstElement;
+ 			Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+
+			/* Memory is allocated as part of initial allocation in ShmemInitHash */
+			if (hashp->isshared)
+				firstElement = (HASHELEMENT *) curr_offset;
+			else
+				firstElement = NULL;		
+				
+			if (!element_alloc(hashp, temp, i, firstElement))
 				ereport(ERROR,
 						(errcode(ERRCODE_OUT_OF_MEMORY),
 						 errmsg("out of memory")));
+			curr_offset = (((char *)curr_offset) + (temp * elementSize));
 		}
 	}
 
@@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem)
 	int			nbuckets;
 	int			nsegs;
 	int			i;
-
+	
 	/*
 	 * initialize mutexes if it's a partitioned table
 	 */
@@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -748,11 +756,19 @@ init_htab(HTAB *hashp, long nelem)
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
+		if (hashp->isshared)
+			*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+				+ sizeof(HASHHDR)
+				+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+				+ (i * sizeof(HASHBUCKET) * hashp->ssize)); 
+		else
+			*segp = seg_alloc(hashp);
 		if (*segp == NULL)
 			return false;
+		i = i + 1;
 	}
 
 	/* Choose number of entries to allocate at a time */
@@ -851,11 +867,32 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_shared_size(HASHCTL *info, int flags, long init_size)
 {
+	int nbuckets;
+	int nsegs;
+	int num_partitions;
+	int ssize;
+	Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
 	Assert(flags & HASH_DIRSIZE);
 	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+
+	if (flags & HASH_PARTITION)
+		num_partitions = info->num_partitions;
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = find_num_of_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT) +
+			+ sizeof(HASHBUCKET) * ssize * nsegs
+			+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1322,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx, NULL))
 		{
 			int			borrow_from_idx;
 
@@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp)
 	HASHSEGMENT segp;
 
 	CurrentDynaHashCxt = hashp->hcxt;
+	
 	segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
 
 	if (!segp)
@@ -1703,11 +1741,10 @@ seg_alloc(HTAB *hashp)
  * allocate some new elements and link them into the indicated free list
  */
 static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
 	HASHELEMENT *tmpElement;
 	HASHELEMENT *prevElement;
 	int			i;
@@ -1717,10 +1754,11 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
-	CurrentDynaHashCxt = hashp->hcxt;
-	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
-
+	if (!firstElement)
+	{
+		CurrentDynaHashCxt = hashp->hcxt;
+		firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
+	}
 	if (!firstElement)
 		return false;
 
@@ -1816,7 +1854,7 @@ next_pow2_long(long num)
 }
 
 /* calculate first power of 2 >= num, bounded to what will fit in an int */
-static int
+int
 next_pow2_int(long num)
 {
 	if (num > INT_MAX / 2)
@@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int nsegs;
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/dynahash.h b/src/include/utils/dynahash.h
index 8a31d9524e..9393018ef6 100644
--- a/src/include/utils/dynahash.h
+++ b/src/include/utils/dynahash.h
@@ -15,6 +15,8 @@
 #ifndef DYNAHASH_H
 #define DYNAHASH_H
 
+#define DEF_SEGSIZE			   256
 extern int	my_log2(long num);
+extern int	next_pow2_int(long num);
 
 #endif							/* DYNAHASH_H */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e16bd4183 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From 7341add1113ebfed5f160f664bc374165586229c Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Sat, 1 Mar 2025 09:02:33 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations
---
 src/backend/storage/lmgr/lwlock.c    |  1 +
 src/backend/storage/lmgr/predicate.c | 10 +++++-----
 src/backend/storage/lmgr/proc.c      |  8 ++++----
 3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8adf273027..9101673185 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -464,6 +464,7 @@ CreateLWLocks(void)
 		Size		spaceLocks = LWLockShmemSize();
 		int		   *LWLockCounter;
 		char	   *ptr;
+		bool found;
 
 		/* Allocate space */
 		ptr = (char *) ShmemAlloc(spaceLocks);
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..58afb94d96 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void)
 		PredXact->HavePartialClearedThrough = 0;
 		requestSize = mul_size((Size) max_table_size,
 							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); 
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a2..e112735d93 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -218,11 +218,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +233,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP);
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
-- 
2.34.1

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Rahila Syed (#1)

Re: Improve monitoring of shared memory allocations

Hi,

Thanks for sending these, the issues addressed here have been bugging me for a
long while.

On 2025-03-01 10:19:01 +0530, Rahila Syed wrote:

The 0001* patch improved the accounting for the shared memory allocated for
a hash table during hash_create. pg_shmem_allocations tracks the memory
allocated by ShmemInitStruct, which, for shared hash tables, only covers
memory allocated for the hash directory and control structure via
ShmemInitHash. The hash segments and buckets are allocated using
ShmemAllocNoError, which does not attribute the allocation to the hash table
and also does not add it to ShmemIndex.

Therefore, these allocations are not tracked in pg_shmem_allocations. To
improve this, include the allocation of segments and buckets in the initial
allocation of the shared memory for the hash table, in ShmemInitHash.

This will result in pg_shmem_allocations representing the total size of the
initial hash table, including all the buckets and elements, instead of just
the directory size.

I think this should also be more efficient. Less space wasted on padding and
fewer indirect function calls.

Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
tracked by pg_shmem_allocations. The 0002* patch replaces most of the calls
to ShmemAlloc with ShmemInitStruct to associate a name with the allocations
and ensure that they get tracked by pg_shmem_allocations.

In some of these cases it may be better to combine the allocation with a prior
ShmemInitStruct instead of doing a separate ShmemInitStruct() for the
allocations that are doing ShmemAlloc() right now.

cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496] 467 | bool found;
[16:49:10.496] | ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195] 588 | void *curr_offset;
[16:49:20.195] | ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1

From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Fri, 21 Feb 2025 15:08:12 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated during
hash_create.

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Improve this include the alocation of segments and buckets or elements
in the initial allocation of shared hash directory. Since this adds numbers
to existing hash table entries, the resulting tuples in pg_shmem_allocations
represent the total size of the initial hash table including all the
buckets and the elements they contain, instead of just the directory size.

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..5203f5b30b 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -120,7 +120,6 @@
* a good idea of the maximum number of entries!).  For non-shared hash
* tables, the initial directory size can be left at the default.
*/
-#define DEF_SEGSIZE			   256
#define DEF_SEGSIZE_SHIFT	   8	/* must be log2(DEF_SEGSIZE) */
#define DEF_DIRSIZE			   256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?

@@ -265,7 +264,7 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
HASHBUCKET **bucketptr);
static long next_pow2_long(long num);
-static int	next_pow2_int(long num);
static void register_seq_scan(HTAB *hashp);
static void deregister_seq_scan(HTAB *hashp);
static bool has_seq_scans(HTAB *hashp);
-
+static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);

/*
* memory allocation support

You removed a newline here that probably shouldn't be removed.

@@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
else
hashp->keycopy = memcpy;
-	/* And select the entry allocation function, too. */
+	/*
+	 * And select the entry allocation function, too. XXX should this also
+	 * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
+	 * currently only set with HASH_SHARED_MEM *
+	 */
if (flags & HASH_ALLOC)
hashp->alloc = info->alloc;
else
@@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
hashp->frozen = false;

+ /* Initializing the HASHHDR variables with default values */
hdefault(hashp);

hctl = hashp->hctl;

I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?

@@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
freelist_partitions,
nelem_alloc,
nelem_alloc_first;
-
+ void *curr_offset;
+
/*
* If hash table is partitioned, give each freelist an equal share of
* the initial allocation. Otherwise only freeList[0] is used.
@@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
else
freelist_partitions = 1;
+		/*
+		 * If table is shared, calculate the offset at which to find the
+		 * the first partition of elements
+		 */
+		if (hashp->isshared)
+		{
+			int			nsegs;
+			int			nbuckets;
+			nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+			
+			curr_offset =  (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		}
+

Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?

nelem_alloc = nelem / freelist_partitions;
if (nelem_alloc <= 0)
nelem_alloc = 1;
@@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
for (i = 0; i < freelist_partitions; i++)
{
int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
-
-			if (!element_alloc(hashp, temp, i))
+			HASHELEMENT *firstElement;
+ 			Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+
+			/* Memory is allocated as part of initial allocation in ShmemInitHash */
+			if (hashp->isshared)
+				firstElement = (HASHELEMENT *) curr_offset;
+			else
+				firstElement = NULL;		
+				
+			if (!element_alloc(hashp, temp, i, firstElement))
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
+			curr_offset = (((char *)curr_offset) + (temp * elementSize));
}
}

Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such? With the latter doing everything
after hashp->alloc().

@@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem)
int nbuckets;
int nsegs;
int i;
-
+
/*
* initialize mutexes if it's a partitioned table
*/

Spurious change.

@@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem)
for (i = 0; i < NUM_FREELISTS; i++)
SpinLockInit(&(hctl->freeList[i].mutex));
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
hctl->max_bucket = hctl->low_mask = nbuckets - 1;
hctl->high_mask = (nbuckets << 1) - 1;

- /*
- * Figure number of directory segments needed, round up to a power of 2
- */
- nsegs = (nbuckets - 1) / hctl->ssize + 1;
- nsegs = next_pow2_int(nsegs);
-
/*
* Make sure directory is big enough. If pre-allocated directory is too
* small, choke (caller screwed up).

A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing. I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.

@@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp)
HASHSEGMENT segp;

CurrentDynaHashCxt = hashp->hcxt;
+
segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);

if (!segp)

Spurious change.

@@ -1816,7 +1854,7 @@ next_pow2_long(long num)
}

/* calculate first power of 2 >= num, bounded to what will fit in an int */
-static int
+int
next_pow2_int(long num)
{
if (num > INT_MAX / 2)
@@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
}
}
}

Why export this?

diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e16bd4183 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
extern void hash_freeze(HTAB *hashp);
extern Size hash_estimate_size(long num_entries, Size entrysize);
extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
extern void AtEOXact_HashTables(bool isCommit);
extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?

--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void)
PredXact->HavePartialClearedThrough = 0;
requestSize = mul_size((Size) max_table_size,
sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found);
/* Add all elements to available list, clean. */
memset(PredXact->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void)
* probably OK.
*/
max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						RWConflictDataSize);

RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
&found);
Assert(found == IsUnderPostmaster);
if (!found)
@@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void)
int			i;

dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); 
/* Add all elements to available list, clean. */
memset(RWConflictPool->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)

These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a2..e112735d93 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -218,11 +218,11 @@ InitProcGlobal(void)
* how hotly they are accessed.
*/
ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.

@@ -233,7 +233,7 @@ InitProcGlobal(void)
fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP);
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
/* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..

Greetings,

Andres Freund

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Andres Freund (#2)

2 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi,

Thank you for the review.

cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’
[-Werror=unused-variable]
[16:49:10.496] 467 | bool found;
[16:49:10.496] | ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37:
lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used
uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195] 643 | curr_offset = (((char
*)curr_offset) + (temp * elementSize));
[16:49:20.195] |
~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195] 588 | void *curr_offset;
[16:49:20.195] | ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1

Fixed these.

diff --git a/src/backend/utils/hash/dynahash.c

b/src/backend/utils/hash/dynahash.c

index cd5a00132f..5203f5b30b 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -120,7 +120,6 @@
* a good idea of the maximum number of entries!).  For non-shared hash
* tables, the initial directory size can be left at the default.
*/
-#define DEF_SEGSIZE                     256
#define DEF_SEGSIZE_SHIFT       8    /* must be log2(DEF_SEGSIZE) */
#define DEF_DIRSIZE                     256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?

Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change.

static void register_seq_scan(HTAB *hashp);
static void deregister_seq_scan(HTAB *hashp);
static bool has_seq_scans(HTAB *hashp);
-
+static int find_num_of_segs(long nelem, int *nbuckets, long
num_partitions, long ssize);

/*
* memory allocation support

You removed a newline here that probably shouldn't be removed.

Fixed this.

@@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const

HASHCTL *info, int flags)
else
hashp->keycopy = memcpy;
-     /* And select the entry allocation function, too. */
+     /*
+      * And select the entry allocation function, too. XXX should this
also
+      * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
+      * currently only set with HASH_SHARED_MEM *
+      */
if (flags & HASH_ALLOC)
hashp->alloc = info->alloc;
else
@@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const
HASHCTL *info, int flags)

hashp->frozen = false;

+ /* Initializing the HASHHDR variables with default values */
hdefault(hashp);

hctl = hashp->hctl;

I assume these were just observations you made while looking into this?
They
seem unrelated to the change itself?

Yes. I removed the first one and left the second one as a code comment.

@@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const

HASHCTL *info, int flags)

freelist_partitions,
nelem_alloc,
nelem_alloc_first;
-
+ void *curr_offset;
+
/*
* If hash table is partitioned, give each freelist an

equal share of

* the initial allocation. Otherwise only freeList[0] is

used.

@@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const

HASHCTL *info, int flags)
else
freelist_partitions = 1;
+             /*
+              * If table is shared, calculate the offset at which to
find the
+              * the first partition of elements
+              */
+             if (hashp->isshared)
+             {
+                     int                     nsegs;
+                     int                     nbuckets;
+                     nsegs = find_num_of_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);
+
+                     curr_offset =  (((char *) hashp->hctl) +
sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+             }
+
Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?

I think it is possible to consolidate the allocations for non-shared hash
tables
too. However, initial elements are much smaller in non-shared hash tables
due to
their ease of expansion. Therefore, there is probably less benefit in
trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in
pg_shmem_allocations which won't be applicable to non-shared hashtables.
While I believe it is feasible, I am uncertain about the utility of such a
change.

Seems a bit ugly to go through element_alloc() when pre-allocating.
Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such? With the latter doing
everything
after hashp->alloc().

Makes sense. I split the element_alloc() into element_alloc() and
element_add().

-
+
/*
* initialize mutexes if it's a partitioned table
*/

Spurious change.

Fixed.

A function called find_num_of_segs() that also sets nbuckets seems a bit

confusing. I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.

I renamed it to compute_buckets_and_segs(). I am open to better
suggestions.

segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) *
hashp->ssize);

if (!segp)

Spurious change.

Fixed.

-static int

+int
next_pow2_int(long num)
{
if (num > INT_MAX / 2)
@@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int

nestDepth)

}
}
}

Why export this?

It was a stale change, I removed it now

diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e16bd4183 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
extern void hash_freeze(HTAB *hashp);
extern Size hash_estimate_size(long num_entries, Size entrysize);
extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(HASHCTL *info, int flags, long
init_size);

extern void AtEOXact_HashTables(bool isCommit);
extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in
hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?

hash_estimate_size() estimates using default values and
hash_get_shared_size()
calculates using specific values depending on the flags associated with the
hash
table. For instance, segment_size used by the former is DEF_SEGSIZE and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.

These I'd just combine with the ShmemInitStruct("PredXactList"), by
allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.

Fixed accordingly.

- (TransactionId *) ShmemAlloc(TotalProcs *

sizeof(*ProcGlobal->xids));

+ (TransactionId *) ShmemInitStruct("Proc Transaction Ids",

TotalProcs * sizeof(*ProcGlobal->xids), &found);

MemSet(ProcGlobal->xids, 0, TotalProcs *

sizeof(*ProcGlobal->xids));

- ProcGlobal->subxidStates = (XidCacheStatus *)

ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));

+ ProcGlobal->subxidStates = (XidCacheStatus *)

ShmemInitStruct("Proc Sub-transaction id states", TotalProcs *
sizeof(*ProcGlobal->subxidStates), &found);

MemSet(ProcGlobal->subxidStates, 0, TotalProcs *

sizeof(*ProcGlobal->subxidStates));

- ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs *

sizeof(*ProcGlobal->statusFlags));

+ ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status

Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);

MemSet(ProcGlobal->statusFlags, 0, TotalProcs *

sizeof(*ProcGlobal->statusFlags));

/*

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.

Made this change.

-     fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+     fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs *
(fpLockBitsSize + fpRelIdSize), &found);

MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));

/* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..

OK

PFA the rebased patches with the above changes.

Kindly let me know your views.

Thank you,
Rahila Syed

Attachments:

v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From 2858595994ccc1d48a50c318c011c22f8e56d7e8 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 ++++++++-------
 src/backend/storage/lmgr/proc.c      | 31 +++++++++++++++++++++++-----
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48e..4befbb0318 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -334,6 +339,22 @@ InitProcGlobal(void)
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1

v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchapplication/octet-stream; name=v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchDownload

From 608d8667fbd05b6902c4e1ad1d810edf7d0f78bf Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 172 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   2 +-
 3 files changed, 135 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..c56e9b6c77 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_shared_size(infoP, hash_flags, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..0c66d63f8e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -518,6 +521,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
+	/* Initializing the HASHHDR variables with default values */
 	hdefault(hashp);
 
 	hctl = hashp->hctl;
@@ -582,6 +586,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +597,21 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+		if (hashp->isshared)
+		{
+			int			nsegs;
+			int			nbuckets;
+
+			nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+			curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+						   +(sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		}
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +629,27 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			if (hashp->isshared)
+			{
+				firstElement = (HASHELEMENT *) curr_offset;
+				curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			}
+			else
+			{
+				firstElement = element_alloc(hashp, temp);
+				if (!firstElement)
+					ereport(ERROR,
+							(errcode(ERRCODE_OUT_OF_MEMORY),
+							 errmsg("out of memory")));
+			}
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +737,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -748,11 +765,22 @@ init_htab(HTAB *hashp, long nelem)
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
+		if (hashp->isshared)
+		{
+			*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+									+ sizeof(HASHHDR)
+									+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+									+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+			MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
+		}
+		else
+			*segp = seg_alloc(hashp);
 		if (*segp == NULL)
 			return false;
+		i = i + 1;
 	}
 
 	/* Choose number of entries to allocate at a time */
@@ -851,11 +879,36 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_shared_size(HASHCTL *info, int flags, long init_size)
 {
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	int			ssize;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
 	Assert(flags & HASH_DIRSIZE);
 	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+
+	if (flags & HASH_PARTITION)
+		num_partitions = info->num_partitions;
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	/* Number of entries should be atleast equal to number of partitions */
+	if (init_size < num_partitions)
+		init_size = num_partitions;
+
+	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1338,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1376,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1757,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
 
+	return firstElement;
+}
+
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1809,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2020,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e16bd4183 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Rahila Syed (#3)

2 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi Andres,

+             if (hashp->isshared)
+             {
+                     int                     nsegs;
+                     int                     nbuckets;
+                     nsegs = find_num_of_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);
+
+                     curr_offset =  (((char *) hashp->hctl) +
sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+             }
+
Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?

I have now made the changes uniformly across shared and non-shared hash
tables,
making the code fix look cleaner. I hope this aligns with your suggestions.
Please find attached updated and rebased versions of both patches.

Kindly let me know your views.

Thank you,
Rahila Syed

Attachments:

v3-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchapplication/octet-stream; name=v3-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchDownload

From 98ca9cc3f34b6dd0892281583b4eb5e38ce0a668 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 209 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 161 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..3bdf3d6fd5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start
+	 * of each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +637,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +734,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -741,22 +755,21 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +864,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements
+	 * are allocated only if they are less than nelem_alloc.
+	 * In any case, the init_size should be equal to the number of
+	 * elements added using element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1351,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1389,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1770,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
 
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1822,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2033,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e513c8116 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+					long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

v3-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=v3-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From 598287c0d366bc32def8e39a747f8f4418376706 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48e..3ae817dedc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1

Nazir Bilal Yavuz

byavuz81@gmail.com

10 months ago

In reply to: Rahila Syed (#4)

Re: Improve monitoring of shared memory allocations

Hi,

On Wed, 12 Mar 2025 at 13:46, Rahila Syed <rahilasyed90@gmail.com> wrote:

I have now made the changes uniformly across shared and non-shared hash tables,
making the code fix look cleaner. I hope this aligns with your suggestions.
Please find attached updated and rebased versions of both patches.

Thank you for working on this!

I have a couple of comments, I have only reviewed 0001 so far.

You may need to run pgindent, it makes some changes.

diff --git a/src/backend/utils/hash/dynahash.c
b/src/backend/utils/hash/dynahash.c
index cd5a00132f..3bdf3d6fd5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c

+        /*
+         * If table is shared, calculate the offset at which to find the the
+         * first partition of elements
+         */
+
+        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);

Blank line between the comment and the code.

+    bool        element_alloc = true;
+    Size        elementSize = MAXALIGN(sizeof(HASHELEMENT)) +
MAXALIGN(info->entrysize);

It looks odd to me that camelCase and snake_case are used together but
it is already used like that in this file; so I think it should be
okay.

 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)

Comment needs an update. This function no longer links elements into
the free list.

+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long
num_partitions, long ssize)
+{
...
+    /*
+     * In a partitioned table, nbuckets must be at least equal to
+     * num_partitions; were it less, keys with apparently different partition
+     * numbers would map to the same bucket, breaking partition independence.
+     * (Normally nbuckets will be much bigger; this is just a safety check.)
+     */
+    while ((*nbuckets) < num_partitions)
+        (*nbuckets) <<= 1;

I have some worries about this function, I am not sure what I said
below has real life implications as you already said 'Normally
nbuckets will be much bigger; this is just a safety check.'.

1- num_partitions is long and nbuckets is int, so could there be any
case where num_partition is bigger than MAX_INT and cause an infinite
loop?
2- Although we assume both nbuckets and num_partition initialized as
the same type, (*nbuckets) <<= 1 will cause an infinite loop if
num_partition is bigger than MAX_TYPE / 2.

So I think that the solution is to confirm that num_partition <
MAX_NBUCKETS_TYPE / 2, what do you think?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Nazir Bilal Yavuz (#5)

2 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi Bilal,

I have a couple of comments, I have only reviewed 0001 so far.

Thank you for reviewing!

You may need to run pgindent, it makes some changes.

Attached v4-patch has been updated after running pgindent.

+ * If table is shared, calculate the offset at which to find the

the
+         * first partition of elements
+         */
+
+        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);

Blank line between the comment and the code.

Removed this.

/*
* allocate some new elements and link them into the indicated free list
*/
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)

Comment needs an update. This function no longer links elements into
the free list.

Updated this and few other comments in the attached v4-patch.

+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long
num_partitions, long ssize)
+{
...
+    /*
+     * In a partitioned table, nbuckets must be at least equal to
+     * num_partitions; were it less, keys with apparently different
partition
+     * numbers would map to the same bucket, breaking partition
independence.
+     * (Normally nbuckets will be much bigger; this is just a safety
check.)
+     */
+    while ((*nbuckets) < num_partitions)
+        (*nbuckets) <<= 1;
I have some worries about this function, I am not sure what I said
below has real life implications as you already said 'Normally
nbuckets will be much bigger; this is just a safety check.'.

1- num_partitions is long and nbuckets is int, so could there be any
case where num_partition is bigger than MAX_INT and cause an infinite
loop?
2- Although we assume both nbuckets and num_partition initialized as
the same type, (*nbuckets) <<= 1 will cause an infinite loop if
num_partition is bigger than MAX_TYPE / 2.

So I think that the solution is to confirm that num_partition <
MAX_NBUCKETS_TYPE / 2, what do you think?

Your concern is valid. This has been addressed in the existing code by
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate
function
for reuse.

Please find attached the updated and rebased patches.

Thank you,
Rahila Syed

Attachments:

v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From 56c1ca94ef109f484a01496f09e9aba2bed0a3a9 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..65239d743d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1

v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchapplication/octet-stream; name=v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchDownload

From 2352d4918f3c5b548790de8e2a673759d83b6f96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 220 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 169 insertions(+), 57 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..550f04359a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,14 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements
+		 */
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +636,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +733,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +750,28 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +866,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1353,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1391,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1770,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1827,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2038,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Rahila Syed (#6)

6 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi,

I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.

Most of my suggestions are about formatting and/or readability. Some of
the likes (e.g. the pointer arithmetics) got so long pgindent would have
to wrap them, but more importantly really hard to decipher what it does.

I've added a couple "review" commits, actually doing most of what I'm
going to suggest.

1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?

2) I find it more readable when ShmemInitStruct() is formatted on
multiple lines. Yes, it's a matter of choice, but also other places do
it this way, I think.

3) The changes in hash_create() are a good example of readability issues
I already mentioned. Consider this line:

curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize *
sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));

First, I'd point out this is not an offset but a pointer, so the
variable name is a bit misleading. But more importantly, I envy those
who can parse this in their head and see if it's correct.

I think it's much better to define a couple macros to calculate parts of
this, essentially a tiny "language" expressing this in a concise way.
The 0002 patch defines

- HASH_ELEMENTS
- HASH_ELEMENT_NEXT
- HASH_SEGMENT_PTR
- HASH_SEGMENT_SIZE
- ...

and then uses that in hash_create(). I believe it's way more readable
like this.

4) I find it a bit strange when a function calculates multiple values,
and then returns them in different ways - one as a return value, one
through a pointer, the way compute_buckets_and_segs() did.

There are cases when it makes sense (e.g. when one of the values is
returned only optionally), but otherwise I think it's better to return
both values in the same way through a pointer. The 0002 patch adjusts
compute_buckets_and_segs() to it like this.

We have a precedent for this - ExecChooseHashTableSize(), which is doing
a very similar thing for sizing hashjoin hash table.

5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does

memset(PredXact->element, 0, requestSize);

and

memset(RWConflictPool->element, 0, requestSize);

with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.

5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...

I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.

6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.

7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.

FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.

Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?

regards

--
Tomas Vondra

Attachments:

v20250324-0001-Account-for-initial-shared-memory-allocate.patchtext/x-patch; charset=UTF-8; name=v20250324-0001-Account-for-initial-shared-memory-allocate.patchDownload

From f527909dda02b4c7231db53a0fe6cecbaec55ca4 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH v20250324 1/6] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 209 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 161 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..ba785f1d5f2 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start
+	 * of each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +637,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +734,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -741,22 +755,21 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +864,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements
+	 * are allocated only if they are less than nelem_alloc.
+	 * In any case, the init_size should be equal to the number of
+	 * elements added using element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1351,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1389,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1770,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
 
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1822,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2033,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..5e513c8116a 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+					long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

v20250324-0002-review.patchtext/x-patch; charset=UTF-8; name=v20250324-0002-review.patchDownload

From d737c4033a92f070c3c3ad791fdef2c8e133394f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 22 Mar 2025 12:20:44 +0100
Subject: [PATCH v20250324 2/6] review

---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 112 ++++++++++++++++++++----------
 src/include/utils/hsearch.h       |   4 +-
 3 files changed, 81 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..51ad68cc796 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -345,9 +345,11 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	infoP->alloc = ShmemAllocNoError;
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
+	/* review: I don't see a reason to rename this to _init_ */
+
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_shared_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index ba785f1d5f2..9792377d567 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,6 +260,30 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+/* review: shouldn't this have some MAXALIGNs? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
@@ -281,9 +305,16 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static int	compute_buckets_and_segs(long nelem, int *nbuckets,
-									 long num_partitions, long ssize);
-static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
+/*
+ * review: it's a bit weird to have a function that calculates two values,
+ * but returns one through a pointer and another as a regular retval. see
+ * ExecChooseHashTableSize for a better approach.
+ */
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -511,7 +542,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
-	/* Choose number of entries to allocate at a time */
+	/* Choose the number of entries to allocate at a time. */
 	nelem_batch = choose_nelem_alloc(info->entrysize);
 
 	/*
@@ -521,7 +552,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size	size = hash_get_shared_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -572,6 +603,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
 	hctl->nelem_alloc = nelem_batch;
 
 	/* make local copies of heavily-used constant fields */
@@ -598,7 +630,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
-		void	   *curr_offset = NULL;
+		void	   *ptr = NULL;		/* XXX it's not offset, clearly */
 		int			nsegs;
 		int			nbuckets;
 
@@ -612,13 +644,18 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 			freelist_partitions = 1;
 
 		/*
-		 * If table is shared, calculate the offset at which to find the the
-		 * first partition of elements
+		 * For shared hash tables, we need to determine the offset for the
+		 * first partition of elements. We have to skip space for the header,
+		 * segments and buckets.
+		 *
+	 	 * XXX can we rely on this matching the calculation in hash_get_shared_size?
+		 * or could/should we add some asserts? Can we have at least some sanity checks
+		 * on nbuckets/nsegs?
 		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
 
-		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
-
-		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		ptr = HASH_ELEMENTS(hashp, nsegs);
 
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
@@ -637,16 +674,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
-			HASHELEMENT *firstElement;
-			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
 			/*
-			 * Memory is allocated as part of initial allocation in
-			 * ShmemInitHash
+			 * Memory is allocated as part of allocation in ShmemInitHash, we
+			 * just need to split that allocation per-batch freelists.
+			 *
+			 * review: why do we invert the argument order?
 			 */
-			firstElement = (HASHELEMENT *) curr_offset;
-			curr_offset = (((char *) curr_offset) + (temp * elementSize));
-			element_add(hashp, firstElement, i, temp);
+			element_add(hashp, (HASHELEMENT *) ptr, i, temp);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -734,7 +770,8 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
@@ -755,22 +792,19 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
 	/* Allocate initial segments */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
-								+ sizeof(HASHHDR)
-								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
-								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
-		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
-
-		i = i + 1;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
+	Assert(i == nsegs);
+
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
 			"TABLE POINTER   ", hashp,
@@ -862,9 +896,14 @@ hash_select_dirsize(long num_entries)
  * Compute the required initial memory allocation for a shared-memory
  * hashtable with the given parameters.  We need space for the HASHHDR
  * and for the (non expansible) directory.
+ *
+ * review: why rename this to "init"
+ * review: also, why adding the "const"?
+ * review: the comment should really explain the arguments, e.g. what
+ * is nelem_alloc for is not obvious?
  */
 Size
-hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
+hash_get_shared_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
 	int			nbuckets;
 	int			nsegs;
@@ -914,7 +953,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	else
 		ssize = DEF_SEGSIZE;
 
-	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
 
 	if (!element_alloc)
 		init_size = 0;
@@ -1791,6 +1831,7 @@ element_alloc(HTAB *hashp, int nelem)
 	return firstElement;
 }
 
+/* review: comment needed, how is the work divided between element_alloc and this? */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
 {
@@ -2034,11 +2075,11 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 	}
 }
 
-static int
-compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+/* review: comment needed */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
 {
-	int			nsegs;
-
 	/*
 	 * Allocate space for the next greater power of two number of buckets,
 	 * assuming a desired maximum load factor of 1.
@@ -2058,7 +2099,6 @@ compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ss
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-	nsegs = ((*nbuckets) - 1) / ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-	return nsegs;
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
 }
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 5e513c8116a..e78ab38b4e8 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,8 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_init_size(const HASHCTL *info, int flags,
-					long init_size, int nelem_alloc);
+extern Size hash_get_shared_size(const HASHCTL *info, int flags,
+								 long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

v20250324-0003-Replace-ShmemAlloc-calls-by-ShmemInitStruc.patchtext/x-patch; charset=UTF-8; name=v20250324-0003-Replace-ShmemAlloc-calls-by-ShmemInitStruc.patchDownload

From bd743402321ae01e27348250865f95290d37b842 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH v20250324 3/6] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..dd66990335b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..65239d743da 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0

v20250324-0004-review.patchtext/x-patch; charset=UTF-8; name=v20250324-0004-review.patchDownload

From eac38c347318a0ee074b25d487de66575a242e12 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 22 Mar 2025 12:50:38 +0100
Subject: [PATCH v20250324 4/6] review

---
 src/backend/storage/lmgr/predicate.c | 21 +++++---
 src/backend/storage/lmgr/proc.c      | 75 +++++++++++++++++++---------
 2 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index dd66990335b..aeba84de786 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1237,6 +1237,9 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1247,7 +1250,9 @@ PredicateLockShmemInit(void)
 		PredXact->HavePartialClearedThrough = 0;
 		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
+		/* XXX wasn't this actually wrong, considering requestSize is the whole
+		 * shmem allocation, including PredXactListDataSize? */
+		// memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,21 +1305,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
-	requestSize = mul_size((Size) max_table_size,
-						   RWConflictDataSize);
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize + requestSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 65239d743da..4afd7cd42c3 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,7 +88,6 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
-static Size PGProcShmemSize(void);
 
 
 /*
@@ -124,6 +123,27 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -177,6 +197,7 @@ InitProcGlobal(void)
 	Size		fpLockBitsSize,
 				fpRelIdSize;
 	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -208,7 +229,12 @@ InitProcGlobal(void)
 	 */
 	requestSize = PGProcShmemSize();
 
-	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr += TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -221,14 +247,27 @@ InitProcGlobal(void)
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
+	 *         presumably that's the padding mentioned by the comment?
+	 *
+	 * review: those lines are too long / not comprehensible, let's define some
+	 *         macros to calculate stuff?
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -238,7 +277,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -335,26 +376,12 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-static Size
-PGProcShmemSize(void)
-{
-	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
-
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	return size;
-}
-
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0

v20250324-0005-remove-cacheline.patchtext/x-patch; charset=UTF-8; name=v20250324-0005-remove-cacheline.patchDownload

From 123e403ed6bfca929b3a33afe210eacc38dd73ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 22 Mar 2025 15:22:08 +0100
Subject: [PATCH v20250324 5/6] remove cacheline

---
 src/backend/storage/lmgr/proc.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4afd7cd42c3..f7957eb008b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,11 +136,8 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -256,15 +253,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

v20250324-0006-add-cacheline-padding-back.patchtext/x-patch; charset=UTF-8; name=v20250324-0006-add-cacheline-padding-back.patchDownload

From 8f54aefb9e3e83973d0a8adf6dc84fe876a0f858 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 22 Mar 2025 15:23:43 +0100
Subject: [PATCH v20250324 6/6] add cacheline padding back

---
 src/backend/storage/lmgr/proc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f7957eb008b..e9c22f03f27 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,8 +136,11 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -242,26 +245,22 @@ InitProcGlobal(void)
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
-	 *
-	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
-	 *         presumably that's the padding mentioned by the comment?
+	 * review: shouldn't the first cacheline padding be right after "procs", before "xids"?
 	 *
 	 * review: those lines are too long / not comprehensible, let's define some
 	 *         macros to calculate stuff?
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Tomas Vondra (#7)

3 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi Tomas,

I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.

Thank you for your interest and review.

1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?

Earlier, this function was used to calculate the size for shared hash
tables only.
Now, it also calculates the size for a non-shared hash table. Hence the
change
of name.

If I don't change the argument to const, I get a compilation error as
follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier from
pointer target type."
This is due to this function being called from hash_create which considers
HASHCTL to be
a const.

5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does

memset(PredXact->element, 0, requestSize);

and

memset(RWConflictPool->element, 0, requestSize);

with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.

Good catch. Thanks for updating.

5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...

I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.

LGTM, long lines have been reduced by the ptr approach.

6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.

LGTM.

7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.

OK.

FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.

I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.

Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?

I agree it makes more sense this way, so changing accordingly.

*
+                * XXX can we rely on this matching the calculation in
hash_get_shared_size?
+                * or could/should we add some asserts? Can we have at
least some sanity checks
+                * on nbuckets/nsegs?

Both places rely on compute_buckets_and_segs() to calculate nbuckets and
nsegs,
hence the probability of mismatch is low. I am open to adding some
asserts to verify this.
Do you have any suggestions in mind?

Please find attached updated patches after merging all your review comments
except
a few discussed above.

Thank you,
Rahila Syed

Attachments:

v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patchapplication/octet-stream; name=v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patchDownload

From 8b8fe395df974e5e7d1ae85d933c53ec86132e1d Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 27 Mar 2025 17:20:12 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b8..edc5c7406b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
 	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 	return size;
 }
@@ -231,7 +234,7 @@ InitProcGlobal(void)
 									   &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -244,11 +247,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *)ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.34.1

v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchapplication/octet-stream; name=v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patchDownload

From 44b3fbb06dd9436c5545c2a93432d65e400a360c Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH 1/2] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 213 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..1f215a16c5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements.
+		 * We have to skip space for the header, segments and buckets.
+ 		 */
+		ptr = HASH_ELEMENTS(hashp, nsegs);
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for
+			 * shared hash tables or earlier in this function for non-shared
+			 * hash tables.
+			 * We just need to split that allocation per-batch freelists.
+			 */
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
 
 /*
  * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.	
+ * 
+ * For shared hash tables the directory size is non-expansive.
+ * 
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true; /*Always true for shared hash tables */
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From 65dab85e3c0d6889cfa02da29c2b11d6dd39b56a Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
 src/backend/storage/lmgr/proc.c      | 56 +++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..de2629fdf0 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..6ee48410b8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +193,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.34.1

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Rahila Syed (#8)

Re: Improve monitoring of shared memory allocations

Hi,

On 3/27/25 13:02, Rahila Syed wrote:

Hi Tomas,

I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.

Thank you for your interest and review.

1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?

Earlier, this function was used to calculate the size for shared hash
tables only.
Now, it also calculates the size for a non-shared hash table. Hence the
change
of name.

If I don't change the argument to const, I get a compilation error as
follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier
from pointer target type."
This is due to this function being called from hash_create which
considers HASHCTL to be
a const.

OK, makes sense. I haven't tried removing the const, but if removing it
would trigger a compiler warning, it's probably needed.

...

FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.

I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.

Neither that seems like a sufficient justification for the change. The
comment is more of a speculation, it doesn't demonstrate the benefits
are real. It's true Andres tends to be right about these things, but
still, I'd like to see some argument for why this helps.

How exactly does padding before/after the whole array help anything?

In fact, is this the padding Andres (or the comment) suggested? The
comment says:

* XXX: It might make sense to increase padding for these arrays, given
* how hotly they are accessed.

Doesn't "padding of array" actually suggest padding each element, not
the array as a whole?

In any case, if the 0003 patch addresses the padding, it should also
remove the XXX comment.

Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?

I agree it makes more sense this way, so changing accordingly.
         *
+                * XXX can we rely on this matching the calculation
in hash_get_shared_size?
+                * or could/should we add some asserts? Can we have
at least some sanity checks
+                * on nbuckets/nsegs?
Both places rely on compute_buckets_and_segs() to calculate nbuckets
and nsegs,
hence the probability of mismatch is low. I am open to adding some
asserts to verify this.
Do you have any suggestions in mind?

Please find attached updated patches after merging all your review
comments except
a few discussed above.

OK, I don't have any other comments for 0001 and 0002. I'll do some
more review and polishing on those, and will get them committed soon.

I don't plan to push 0003, unless someone can actually explain and
demonstrate the benefits of the proposed padding,

regards

--
Tomas Vondra

#10

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#9)

7 attachment(s)

Re: Improve monitoring of shared memory allocations

On 3/27/25 13:56, Tomas Vondra wrote:

...

OK, I don't have any other comments for 0001 and 0002. I'll do some
more review and polishing on those, and will get them committed soon.

Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.

1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Now, back to the patch. The code originally did this in ShmemInitStruct

hashp = ShmemInitStruct(...)

to allocate the hctl, and then

firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))

First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

So something like this might be more correct:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+	 MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

In fact, I'm starting to wonder if we need to worry about this at all.
Maybe it's not that significant for dynahash after all - otherwise we
actually should align/pad the individual elements, not just the array as
a whole, I think.

2) I do think the last patch should use CACHELINEALIGN() too, instead of
adding PG_CACHE_LINE_SIZE to the sizes (which does not ensure good
alignment, it just means there's gap between the allocated parts). But I
still don't know what the intention was, so hard to say ...

3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:

* init_size should match the total number of elements allocated during
* hash table creation, it could be zero for non-shared hash tables
* depending on the value of nelem_alloc. For more explanation see
* comments within this function.
*
* nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right? But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

Then it says "nelem_alloc parameter is not relevant for shared hash
tables". What does "not relevant" mean? It should say it's ignored.

The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.

regards

--
Tomas Vondra

Attachments:

v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patchtext/x-patch; charset=UTF-8; name=v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patchDownload

From a0ab96326687ff8eb9ff3fd8343cff545226fd0a Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH v6 1/7] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 213 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..1f215a16c51 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements.
+		 * We have to skip space for the header, segments and buckets.
+ 		 */
+		ptr = HASH_ELEMENTS(hashp, nsegs);
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for
+			 * shared hash tables or earlier in this function for non-shared
+			 * hash tables.
+			 * We just need to split that allocation per-batch freelists.
+			 */
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
 
 /*
  * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.	
+ * 
+ * For shared hash tables the directory size is non-expansive.
+ * 
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true; /*Always true for shared hash tables */
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..79b959ffc3c 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

v6-0002-review.patchtext/x-patch; charset=UTF-8; name=v6-0002-review.patchDownload

From 5d504b93f4880c7d1836304b7ad1a212bd5a5c58 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 27 Mar 2025 20:52:31 +0100
Subject: [PATCH v6 2/7] review

---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 127 ++++++++++++++++--------------
 2 files changed, 70 insertions(+), 60 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..7e03a371d22 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -347,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_init_size(infoP, hash_flags,
+												  init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1f215a16c51..dcdc143e690 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,29 +260,35 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
-
-#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
-
-#define HASH_ELEMENTS(hashp, nsegs) \
-	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
 
 #define HASH_SEGMENT_OFFSET(hctl, idx) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
 
 #define HASH_SEGMENT_PTR(hashp, idx) \
-	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
 
-#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+/* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
 
-#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
 
 #define HASH_ELEMENT_NEXT(hctl, num, ptr) \
-	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
 
 /*
  * Private function prototypes
@@ -305,8 +311,8 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static void	compute_buckets_and_segs(long nelem, long num_partitions,
-									 long ssize, /* segment size */
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
 									 int *nbuckets, int *nsegments);
 static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
 						int nelem, int freelist_idx);
@@ -547,7 +553,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -638,16 +644,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
-		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
-								 &nbuckets, &nsegs);
-
-		/*
-		 * Calculate the offset at which to find the first partition of
-		 * elements.
-		 * We have to skip space for the header, segments and buckets.
- 		 */
-		ptr = HASH_ELEMENTS(hashp, nsegs);
-
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -662,19 +658,29 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			/*
-			 * Assign the correct location of each parition within a
-			 * pre-allocated buffer.
-			 *
-			 * Actual memory allocation happens in ShmemInitHash for
-			 * shared hash tables or earlier in this function for non-shared
-			 * hash tables.
-			 * We just need to split that allocation per-batch freelists.
-			 */
 			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
 			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
@@ -789,14 +795,14 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = HASH_DIRECTORY(hashp);
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
 	/* Assign initial segments, which are also pre-allocated */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
 		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
@@ -890,19 +896,22 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * or non-shared memory hashtable with the given parameters.
- * We need space for the HASHHDR, for the directory, segments and
- * the init_size elements in buckets.	
- * 
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
  * For shared hash tables the directory size is non-expansive.
- * 
- * init_size should match the total number of elements allocated
- * during hash table creation, it could be zero for non-shared hash
- * tables depending on the value of nelem_alloc. For more explanation
- * see comments within this function.
+ *
+ * init_size should match the total number of elements allocated during hash
+ * table creation, it could be zero for non-shared hash tables depending on the
+ * value of nelem_alloc. For more explanation see comments within this function.
  *
  * nelem_alloc parameter is not relevant for shared hash tables.
+ *
+ * XXX I find this comment hard to understand / confusing. So what is init_size?
+ * Why should it match the total number of elements allocated during hash table
+ * creation? What does "it could be zero" say? Should it be zero, or why would I
+ * want to make it zero in some cases?
  */
 Size
 hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
@@ -912,8 +921,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	int			num_partitions;
 	long		ssize;
 	long		dsize;
-	bool		element_alloc = true; /*Always true for shared hash tables */
-	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+	bool		element_alloc = true;	/* Always true for shared hash tables */
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
 
 	/*
 	 * For non-shared hash tables, the requested number of elements are
@@ -931,6 +940,7 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 		Assert(flags & HASH_DIRSIZE);
 		Assert(info->dsize == info->max_dsize);
 	}
+
 	/* Non-shared hash tables may not specify dir size */
 	if (!(flags & HASH_DIRSIZE))
 	{
@@ -961,8 +971,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
-		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
 		+ init_size * elementSize;
 }
 
@@ -1393,8 +1403,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		newElement = element_alloc(hashp, hctl->nelem_alloc);
-		if (newElement == NULL)
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1823,7 +1832,7 @@ element_alloc(HTAB *hashp, int nelem)
 		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
@@ -1834,7 +1843,7 @@ element_alloc(HTAB *hashp, int nelem)
 }
 
 /*
- * Link the elements allocated by element_alloc into the indicated free list
+ * link the elements allocated by element_alloc into the indicated free list
  */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
@@ -1846,7 +1855,8 @@ element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
 	int			i;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
+
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -2103,7 +2113,6 @@ compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
 	while ((*nbuckets) < num_partitions)
 		(*nbuckets) <<= 1;
 
-
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-- 
2.49.0

v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patchtext/x-patch; charset=UTF-8; name=v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patchDownload

From 88974dc9a36759eaf2b0048b6ac45bc0c83cb537 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 27 Mar 2025 21:58:22 +0100
Subject: [PATCH v6 3/7] align using CACHELINESIZE, to match ShmemAllocRaw

---
 src/backend/utils/hash/dynahash.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index dcdc143e690..fc5575daaea 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -276,7 +276,7 @@ static long hash_accesses,
 
 /* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
 #define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(sizeof(HASHHDR) + \
+	CACHELINEALIGN(sizeof(HASHHDR) + \
 	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
 	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
 
@@ -971,9 +971,11 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
-		+ sizeof(HASHBUCKET) * ssize * nsegs
-		+ init_size * elementSize;
+	/* enforce elements to be cacheline aligned */
+	return CACHELINEALIGN(sizeof(HASHHDR)
+			+ (dsize * sizeof(HASHSEGMENT))
+			+ (ssize * nsegs * sizeof(HASHBUCKET)))
+			+ (init_size * elementSize);
 }
 
 
-- 
2.49.0

v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchtext/x-patch; charset=UTF-8; name=v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From edccce9feb8a930150d97d034ef96c6bff544d96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH v6 4/7] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
 src/backend/storage/lmgr/proc.c      | 56 +++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..de2629fdf0c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..6ee48410b84 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +193,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.49.0

v6-0005-review.patchtext/x-patch; charset=UTF-8; name=v6-0005-review.patchDownload

From 03809b97dca1a25b1b61f119ec05ef4b1d8e5b49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 27 Mar 2025 21:20:26 +0100
Subject: [PATCH v6 5/7] review

---
 src/backend/storage/lmgr/predicate.c | 13 ++++++++-----
 src/backend/storage/lmgr/proc.c      | 21 +++++++++++----------
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index de2629fdf0c..d82114ffca1 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1229,6 +1229,7 @@ PredicateLockShmemInit(void)
 	requestSize = add_size(PredXactListDataSize,
 						   (mul_size((Size) max_table_size,
 									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
 							   requestSize,
 							   &found);
@@ -1237,7 +1238,7 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
-		/* reset everything, both the header and the element */
+		/* clean everything, both the header and the element */
 		memset(PredXact, 0, requestSize);
 
 		dlist_init(&PredXact->availableList);
@@ -1248,7 +1249,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
@@ -1302,9 +1304,10 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+
 	requestSize = RWConflictPoolHeaderDataSize +
-					mul_size((Size) max_table_size,
-							 RWConflictDataSize);
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
 									 requestSize,
@@ -1319,7 +1322,7 @@ PredicateLockShmemInit(void)
 
 		dlist_init(&RWConflictPool->availableList);
 		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
-			RWConflictPoolHeaderDataSize);
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b84..c08e3bb3d56 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -124,20 +124,21 @@ ProcGlobalShmemSize(void)
 }
 
 /*
- * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
- * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
- * a separate commit, not to mix it with the "monitoring improvement"
+ * Report shared-memory space needed by PGPROC.
  */
 static Size
 PGProcShmemSize(void)
 {
 	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
 	return size;
 }
 
@@ -227,11 +228,11 @@ InitProcGlobal(void)
 	requestSize = PGProcShmemSize();
 
 	ptr = ShmemInitStruct("PGPROC structures",
-									   requestSize,
-									   &found);
+						  requestSize,
+						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -244,15 +245,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patchtext/x-patch; charset=UTF-8; name=v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patchDownload

From 2a266c6115dca6ad32f7dbb136b63cf8d278504d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 27 Mar 2025 21:23:05 +0100
Subject: [PATCH v6 6/7] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c08e3bb3d56..6e0fc6a08e6 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -232,7 +235,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -245,11 +248,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0

v6-0007-review.patchtext/x-patch; charset=UTF-8; name=v6-0007-review.patchDownload

From 8f74e27118cc68c5348def34cfda814ca846db49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 27 Mar 2025 22:03:23 +0100
Subject: [PATCH v6 7/7] review

---
 src/backend/storage/lmgr/proc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6e0fc6a08e6..5757545defb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,12 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -235,7 +232,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -248,11 +245,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0

#11

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Tomas Vondra (#10)

Re: Improve monitoring of shared memory allocations

Hi Tomas,

1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

I have added MAXALIGN to specific allocations, such as HASHHDR and
HASHSEGMENT, with the expectation that allocations in multiples of this,
like dsize * HASHSEGMENT, would automatically align.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Thank you for your explanation. I had a similar understanding. However,
I believed that MAXALIGN and CACHEALIGN are primarily performance
optimizations
that do not impact the correctness of the code. This assumption is based on
the fact
that I have not observed any failures on GitHub CI, even when changing the
alignment
in this part of the code.

Now, back to the patch. The code originally did this in ShmemInitStruct

hashp = ShmemInitStruct(...)

to allocate the hctl, and then

firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

Yes, the patch reworks this to allocate all the shared memory at once.
It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the whole
chunk
of memory allocated is still CACHEALIGNed.

I mentioned v5 adds alignment, but I think it does not quite do that

quite correctly. It adds alignment by changing the macros from:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (sizeof(HASHHDR) + \
+        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
to
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (MAXALIGN(sizeof(HASHHDR)) + \
+        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

Good catch. For a shared hash table, allocations need to be
CACHELINEALIGNED.
I think hash_get_init_size does not need to call CACHELINEALIGNED
explicitly as ShmemInitStruct already does this.
In that case, the size returned by hash_get_init_size just needs to
MAXALIGN required structs as per hash_create() requirements and
CACHELINEALIGN
will be taken care of in ShmemInitStruct at the time of allocating the
entire chunk.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

I think existing code has occurrences of both i,.e aligning individual
elements and
arrays.
A similar precedent exists in the function hash_estimate_size(), which only
applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
entrysize, but also an array of HASHBUCKET headers.

I agree with you that perhaps we don't need maxalignment for all of these
structures.
For ex, HASHBUCKET is a pointer to a linked list of elements, it might not
require alignment
if the elements it points to are already aligned.

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

Before this patch, following structures were allocated separately using
ShmemAllocRaw
directory, each segment(seg_alloc) and a chunk of elements (element_alloc).
Hence,
I don't understand why v-0003* CACHEALIGNs in the manner it does.

I think if we want to emulate the current behaviour we should do something
like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
                + CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
                + CACHELINEALIGN(init_size * elementSize);

Like you mentioned the only difference would be that we would be aligning
all elements
at once instead of aligning individual partitions of elements.

3) I find the comment before hash_get_init_size a bit unclear/confusing.

It says this:

* init_size should match the total number of elements allocated during
* hash table creation, it could be zero for non-shared hash tables
* depending on the value of nelem_alloc. For more explanation see
* comments within this function.
*
* nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

by should match I mean - init_size here *is* equal to nelem in
hash_create() .

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right?

No. All the init_size elements are pre-allocated for shared hash table
irrespective of
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of
expansion.

But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

The relationship is defined in this comment:
/*
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc. This avoids wasting
* space if the caller correctly estimates a small table size.
*/

hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In the above case nelem_alloc refers to the one
returned by choose_nelem_alloc function.

The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in the
above
comment.

The bit "For more explanation see comments within this function" is not

great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.

I will improve the comment in the next version.

Thank you,
Rahila Syed

#12

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Rahila Syed (#11)

Re: Improve monitoring of shared memory allocations

On 3/28/25 12:10, Rahila Syed wrote:

Hi Tomas,

1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

I have added MAXALIGN to specific allocations, such as HASHHDR and
HASHSEGMENT, with the expectation that allocations in multiples of this,
like dsize * HASHSEGMENT, would automatically align.

Yes, assuming the original allocation is aligned, maxalign-ing the
smaller allocations would ensure that. But there's still the issue with
aligning array elements.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Thank you for your explanation. I had a similar understanding. However,
I believed that MAXALIGN and CACHEALIGN are primarily performance
optimizations
that do not impact the correctness of the code. This assumption is based
on the fact
that I have not observed any failures on GitHub CI, even when changing
the alignment
in this part of the code.

I hope it didn't come over as explaining something you already know ...
Apologies if that's the case.

As for why the CI doesn't fail on this, that's because it runs on x86,
which tolerates alignment issues. AFAIK all "modern" platforms do. You'd
need some old platform (I recall we saw SIGBUG on powerpc and s390, or
something like that). It'd probably matter even on x86 with something
like AVX, but that's irrelevant for this patch.

I don't even know what is the performance impact on x86. I think it was
measurable years ago, but it got mostly negligible for recent CPUs.

Another reason is that some of those structs may be "naturally aligned"
because they happen to be multiples of MAXALIGN. For example:

HASHHDR -> 96 bytes
HASHELEMENT -> 16 bytes (this covers HASHBUCKET / HASHSEGMENT too)

Now, back to the patch. The code originally did this in ShmemInitStruct

hashp = ShmemInitStruct(...)

to allocate the hctl, and then

firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

Yes, the patch reworks this to allocate all the shared memory at once.
It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the
whole chunk
of memory allocated is still CACHEALIGNed.

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (sizeof(HASHHDR) + \
+        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
to
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (MAXALIGN(sizeof(HASHHDR)) + \
+        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

Good catch. For a shared hash table, allocations need to be
CACHELINEALIGNED.
I think hash_get_init_size does not need to call CACHELINEALIGNED
explicitly as ShmemInitStruct already does this.
In that case, the size returned by hash_get_init_size just needs to
MAXALIGN required structs as per hash_create() requirements and
CACHELINEALIGN
will be taken care of in ShmemInitStruct at the time of allocating the
entire chunk.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

I think existing code has occurrences of both i,.e aligning individual
elements and
arrays.
A similar precedent exists in the function hash_estimate_size(), which only
applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
entrysize, but also an array of HASHBUCKET headers.

I agree with you that perhaps we don't need maxalignment for all of
these structures.
For ex, HASHBUCKET is a pointer to a linked list of elements, it might
not require alignment
if the elements it points to are already aligned.

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

Before this patch, following structures were allocated separately using
ShmemAllocRaw
directory, each segment(seg_alloc) and a chunk of elements
(element_alloc). Hence,
I don't understand why v-0003* CACHEALIGNs in the manner it does.

I may not have gotten it quite right. The intent was to align it the
same way as before, and the allocation used to be sized like this:

Size
hash_get_shared_size(HASHCTL *info, int flags)
{
Assert(flags & HASH_DIRSIZE);
Assert(info->dsize == info->max_dsize);
return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
}

But I got confused a bit. I think you're correct here:

I think if we want to emulate the current behaviour we should do
something like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
+ CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
+ CACHELINEALIGN(init_size * elementSize);

Like you mentioned the only difference would be that we would be
aligning all elements
at once instead of aligning individual partitions of elements.

Right. I'm still not convinced if this makes any difference, or whether
this alignment was merely a consequence of using ShmemAlloc(). I don't
want to make this harder to understand unnecessarily.

Let's keep this simple - without additional alignment. I'll think about
it a bit more, and maybe add it before commit.

3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:

* init_size should match the total number of elements allocated during
* hash table creation, it could be zero for non-shared hash tables
* depending on the value of nelem_alloc. For more explanation see
* comments within this function.
*
* nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

by should match I mean - init_size here *is* equal to nelem in
hash_create() .

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right?

No. All the init_size elements are pre-allocated for shared hash table
irrespective of
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of
expansion.

But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

The relationship is defined in this comment:
/*
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc. This avoids wasting
* space if the caller correctly estimates a small table size.
*/

hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In the above case nelem_alloc refers to the one
returned by choose_nelem_alloc function.

The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in
the above
comment.

The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.

I will improve the comment in the next version.

OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:

+ if (init_size > nelem_alloc)
+ element_alloc = false;

Can't we determine before calling the function, to make it a bit less
confusing?

regards

--
Tomas Vondra

#13

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Tomas Vondra (#12)

3 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi Tomas,

Right. I'm still not convinced if this makes any difference, or whether
this alignment was merely a consequence of using ShmemAlloc(). I don't
want to make this harder to understand unnecessarily.

Yeah, it makes sense.

Let's keep this simple - without additional alignment. I'll think about
it a bit more, and maybe add it before commit.

OK.

I will improve the comment in the next version.

OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:

+ if (init_size > nelem_alloc)
+ element_alloc = false;

Can't we determine before calling the function, to make it a bit less
confusing?

Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001
patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is
false,
no elements are pre-allocated.

Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.

Thank you,
Rahila Syed

Attachments:

v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchapplication/octet-stream; name=v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patchDownload

From dfb7731cd9f4687f4d90a5210fb32c3a78fc4a5f Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 03:45:50 +0530
Subject: [PATCH 2/3] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 30 ++++++++++-----
 src/backend/storage/lmgr/proc.c      | 57 +++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..d82114ffca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,21 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1249,9 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,20 +1305,25 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 5;
 
+	requestSize = RWConflictPoolHeaderDataSize +
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
+
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 066319afe2..4e23c793f7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,25 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * Report shared-memory space needed by PGPROC.
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +194,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +225,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+						  requestSize,
+						  &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +242,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +266,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +365,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.34.1

v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patchapplication/octet-stream; name=v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patchDownload

From 61712f86abd9edae190d27dc08be7b9753751a1d Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 04:27:43 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4e23c793f7..8363e2f61d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,9 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -232,7 +232,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -245,11 +245,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.34.1

v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patchapplication/octet-stream; name=v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patchDownload

From 4cce66bdfbb53b7ef3d9a7058f5979ac3b4dc89f Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 03:23:20 +0530
Subject: [PATCH 1/3] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 283 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 222 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..3c030d5743 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags,
+												  true, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..3dede9caa5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
+	bool 	initial_elems = false;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +538,35 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Calculate the number of elements to allocate
+	 *
+	 * For a shared hash table, preallocate the requested number of elements.
+	 * This reduces problems with run-time out-of-shared-memory conditions.
+	 *
+	 * For a non-shared hash table, preallocate the requested number of
+	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
+	 * space if the caller correctly estimates a small table size.
+	 */
+	if ((flags & HASH_SHARED_MEM) ||
+		nelem < nelem_batch)
+		initial_elems = true;
+	else
+		initial_elems = false;
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +615,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -567,14 +627,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	if (!init_htab(hashp, nelem))
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
-	/*
-	 * For a shared hash table, preallocate the requested number of elements.
-	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
-	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
@@ -582,6 +634,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +661,31 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +773,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +791,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -846,16 +899,70 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
+ * For shared hash tables the directory size is non-expansive.
+ *
+ * nelem is total number of elements requested during hash table creation.
+ *
+ * initial_nelems is true if it is decided to pre-allocate elements and false
+ * otherwise.
+ *
+ * For a shared hash table initial_elems is always true.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+	int 	init_size = 0;
+
+	if (flags & HASH_SHARED_MEM)
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	/* initial_elems as false indicates no elements are to be pre-allocated */
+	if (initial_elems)
+		init_size = nelem;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1392,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1429,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1808,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1866,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2077,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..4d5f09081b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   bool initial_elems, long nelem);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

#14

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Rahila Syed (#13)

5 attachment(s)

Re: Improve monitoring of shared memory allocations

On 3/31/25 01:01, Rahila Syed wrote:

...

I will improve the comment in the next version.

OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:

+ if (init_size > nelem_alloc)
+ element_alloc = false;

Can't we determine before calling the function, to make it a bit less
confusing?

Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001
patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is
false,
no elements are pre-allocated.

Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.

I think it's almost committable. Attached is v8 with some minor review
adjustments, and updated commit messages. Please read through those and
feel free to suggest changes.

I still found the hash_get_init_size() comment unclear, and it also
referenced init_size, which is no longer relevant. I improved the
comment a bit (I find it useful to mimic comments of nearby functions,
so I did that too here). The "initial_elems" name was a bit confusing,
as it seemed to suggest "number of elements", but it's a simple flag. So
I renamed it to "prealloc", which seems clearer to me. I also tweaked
(reordered/reformatted) the conditions a bit.

For the other patch, I realized we can simply MemSet() the whole chunk,
instead of resetting the individual parts.

regards

--
Tomas Vondra

Attachments:

v8-0001-Improve-acounting-for-memory-used-by-shared-hash-.patchtext/x-patch; charset=UTF-8; name=v8-0001-Improve-acounting-for-memory-used-by-shared-hash-.patchDownload

From 1a82ac7407eb88bc085694519739115f718cc59a Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 03:23:20 +0530
Subject: [PATCH v8 1/5] Improve acounting for memory used by shared hash
 tables

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory.  The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.

This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.

This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.

This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunks removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 283 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 222 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..3c030d5743d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags,
+												  true, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..3dede9caa5d 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
+	bool 	initial_elems = false;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +538,35 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Calculate the number of elements to allocate
+	 *
+	 * For a shared hash table, preallocate the requested number of elements.
+	 * This reduces problems with run-time out-of-shared-memory conditions.
+	 *
+	 * For a non-shared hash table, preallocate the requested number of
+	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
+	 * space if the caller correctly estimates a small table size.
+	 */
+	if ((flags & HASH_SHARED_MEM) ||
+		nelem < nelem_batch)
+		initial_elems = true;
+	else
+		initial_elems = false;
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +615,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -567,14 +627,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	if (!init_htab(hashp, nelem))
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
-	/*
-	 * For a shared hash table, preallocate the requested number of elements.
-	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
-	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
@@ -582,6 +634,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +661,31 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +773,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +791,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -846,16 +899,70 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
+ * For shared hash tables the directory size is non-expansive.
+ *
+ * nelem is total number of elements requested during hash table creation.
+ *
+ * initial_nelems is true if it is decided to pre-allocate elements and false
+ * otherwise.
+ *
+ * For a shared hash table initial_elems is always true.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+	int 	init_size = 0;
+
+	if (flags & HASH_SHARED_MEM)
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	/* initial_elems as false indicates no elements are to be pre-allocated */
+	if (initial_elems)
+		init_size = nelem;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1392,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1429,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1808,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1866,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2077,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..4d5f09081b4 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   bool initial_elems, long nelem);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

v8-0002-review.patchtext/x-patch; charset=UTF-8; name=v8-0002-review.patchDownload

From d80cdbcac8cb4105d35aeaa8f0600b3b5f530942 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 31 Mar 2025 14:54:06 +0200
Subject: [PATCH v8 2/5] review

---
 src/backend/storage/ipc/shmem.c   |  2 +-
 src/backend/utils/hash/dynahash.c | 71 ++++++++++++++++---------------
 src/include/utils/hsearch.h       |  2 +-
 3 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 3c030d5743d..8e19134cb5c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -348,7 +348,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
 							   hash_get_init_size(infoP, hash_flags,
-												  true, init_size),
+												  init_size, true),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3dede9caa5d..0240a871395 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -383,7 +383,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
 	int			nelem_batch;
-	bool 	initial_elems = false;
+	bool		prealloc;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -547,15 +547,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 * For a shared hash table, preallocate the requested number of elements.
 	 * This reduces problems with run-time out-of-shared-memory conditions.
 	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
+	 * For a private hash table, preallocate the requested number of elements
+	 * if it's less than our chosen nelem_alloc.  This avoids wasting space if
+	 * the caller correctly estimates a small table size.
 	 */
-	if ((flags & HASH_SHARED_MEM) ||
-		nelem < nelem_batch)
-		initial_elems = true;
-	else
-		initial_elems = false;
+	prealloc = (flags & HASH_SHARED_MEM) || (nelem < nelem_batch);
 
 	/*
 	 * Allocate the memory needed for hash header, directory, segments and
@@ -564,7 +560,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+		Size		size = hash_get_init_size(info, flags, nelem, prealloc);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -664,7 +660,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		/*
 		 * Calculate the offset at which to find the first partition of
 		 * elements. We have to skip space for the header, segments and
-		 * buckets.
+		 * buckets. We need to recalculate the number of segments, which we
+		 * don't store anywhere.
 		 */
 		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
 								 &nbuckets, &nsegs);
@@ -899,21 +896,24 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a hashtable with the given
- * parameters. The hash table may be shared or private. We need space for the
- * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ * hash_get_init_size -- determine memory needed for a new dynamic hash table
  *
- * For shared hash tables the directory size is non-expansive.
- *
- * nelem is total number of elements requested during hash table creation.
+ *	info: hash table parameters
+ *	flags: bitmask indicating which parameters to take from *info
+ *	nelem: maximum number of elements expected
+ *	prealloc: request preallocation of elements
  *
- * initial_nelems is true if it is decided to pre-allocate elements and false
- * otherwise.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. We need space for the HASHHDR, for the directory, segments and
+ * preallocated elements.
  *
- * For a shared hash table initial_elems is always true.
+ * The hash table may be private or shared. For shared hash tables the directory
+ * size is non-expansive, and we preallocate all elements (nelem). For private
+ * hash tables, we preallocate elements only if the expected number of elements
+ * is small (less than nelem_alloc).
  */
 Size
-hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
+hash_get_init_size(const HASHCTL *info, int flags, long nelem, bool prealloc)
 {
 	int			nbuckets;
 	int			nsegs;
@@ -921,21 +921,29 @@ hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nele
 	long		ssize;
 	long		dsize;
 	Size		elementSize = HASH_ELEMENT_SIZE(info);
-	int 	init_size = 0;
+	long		nelem_prealloc = 0;
 
+#ifdef USE_ASSERT_CHECKING
+	/* shared hash tables have non-expansive directory */
+	/* XXX what about segment size? should check have HASH_SEGMENT? */
 	if (flags & HASH_SHARED_MEM)
 	{
 		Assert(flags & HASH_DIRSIZE);
 		Assert(info->dsize == info->max_dsize);
+		Assert(prealloc);
 	}
+#endif
 
 	/* Non-shared hash tables may not specify dir size */
-	if (!(flags & HASH_DIRSIZE))
-	{
+	if (flags & HASH_DIRSIZE)
+		dsize = info->dsize;
+	else
 		dsize = DEF_DIRSIZE;
-	}
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
 	else
-		dsize = info->dsize;
+		ssize = DEF_SEGSIZE;
 
 	if (flags & HASH_PARTITION)
 	{
@@ -948,21 +956,16 @@ hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nele
 	else
 		num_partitions = 0;
 
-	if (flags & HASH_SEGMENT)
-		ssize = info->ssize;
-	else
-		ssize = DEF_SEGSIZE;
-
 	compute_buckets_and_segs(nelem, num_partitions, ssize,
 							 &nbuckets, &nsegs);
 
 	/* initial_elems as false indicates no elements are to be pre-allocated */
-	if (initial_elems)
-		init_size = nelem;
+	if (prealloc)
+		nelem_prealloc = nelem;
 
 	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
 		+ sizeof(HASHBUCKET) * ssize * nsegs
-		+ init_size * elementSize;
+		+ nelem_prealloc * elementSize;
 }
 
 
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 4d5f09081b4..5dc74db6e96 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -152,7 +152,7 @@ extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
 extern Size hash_get_init_size(const HASHCTL *info, int flags,
-							   bool initial_elems, long nelem);
+							   long nelem, bool prealloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

v8-0003-Improve-accounting-for-PredXactList-RWConflictPoo.patchtext/x-patch; charset=UTF-8; name=v8-0003-Improve-accounting-for-PredXactList-RWConflictPoo.patchDownload

From 609404c682e28a59f80ac7630e4be8e46d12566e Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 03:45:50 +0530
Subject: [PATCH v8 3/5] Improve accounting for PredXactList, RWConflictPool
 and PGPROC

Various places allocated shared memory by first allocating a small chunk
using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more
memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this
affected pg_shmem_allocations - it only shown the initial chunk.

This commit modifies the following allocations, to allocate everything
as a single chunk, and then split it internally.

- PredXactList
- RWConflictPool
- PGPROC structures
- Fast-Path lock arrays

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/lmgr/predicate.c | 30 ++++++++++-----
 src/backend/storage/lmgr/proc.c      | 57 +++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..d82114ffca1 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,21 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1249,9 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,20 +1305,25 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 5;
 
+	requestSize = RWConflictPoolHeaderDataSize +
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
+
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 066319afe2b..4e23c793f7f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,25 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * Report shared-memory space needed by PGPROC.
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +194,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +225,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+						  requestSize,
+						  &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +242,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +266,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +365,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.49.0

v8-0004-review.patchtext/x-patch; charset=UTF-8; name=v8-0004-review.patchDownload

From a0d81976406ea47ac249f07bf850e65d126627cb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 31 Mar 2025 14:56:56 +0200
Subject: [PATCH v8 4/5] review

---
 src/backend/storage/lmgr/proc.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4e23c793f7f..ba1ac7e4861 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -231,10 +231,11 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
+	MemSet(ptr, 0, requestSize);
+
 	procs = (PGPROC *) ptr;
 	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
 
-	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
@@ -244,15 +245,12 @@ InitProcGlobal(void)
 	 * PROC_HDR.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
-	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
-	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
-	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/* make sure wer didn't overflow */
-- 
2.49.0

v8-0005-Add-cacheline-padding-between-heavily-accessed-ar.patchtext/x-patch; charset=UTF-8; name=v8-0005-Add-cacheline-padding-between-heavily-accessed-ar.patchDownload

From 71b206e3b17b167733055c8a392be687deee6410 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Mon, 31 Mar 2025 04:27:43 +0530
Subject: [PATCH v8 5/5] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ba1ac7e4861..c39b51127a8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,9 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -234,7 +234,7 @@ InitProcGlobal(void)
 	MemSet(ptr, 0, requestSize);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -245,10 +245,10 @@ InitProcGlobal(void)
 	 * PROC_HDR.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0

#15

Rahila Syed

rahilasyed90@gmail.com

10 months ago

In reply to: Tomas Vondra (#14)

3 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi,

I think it's almost committable. Attached is v8 with some minor review

adjustments, and updated commit messages. Please read through those and
feel free to suggest changes.

The changes look good to me.
About the following question.
/* XXX what about segment size? should check have HASH_SEGMENT? */
Do you mean for a shared hash table should the caller have specified
HASH_SEGMENT
in flags?
It appears that the current code does not require this change. All the
shared hash tables seem
to have the default segment size.
I left the comment as it is as I am not sure if you intend to remove it or
not.

I still found the hash_get_init_size() comment unclear, and it also
referenced init_size, which is no longer relevant. I improved the
comment a bit (I find it useful to mimic comments of nearby functions,
so I did that too here). The "initial_elems" name was a bit confusing,
as it seemed to suggest "number of elements", but it's a simple flag. So
I renamed it to "prealloc", which seems clearer to me. I also tweaked
(reordered/reformatted) the conditions a bit.

I appreciate your edtis, the comment and code are clearer now.

PFA the patches after merging the review patches.

Thank you,
Rahila Syed

Attachments:

v9-0001-Improve-acounting-for-memory-used-by-shared-hash-tab.patchapplication/octet-stream; name=v9-0001-Improve-acounting-for-memory-used-by-shared-hash-tab.patchDownload

From 44a2882e490cd0cd43751d24a864ea24e2b20135 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Tue, 1 Apr 2025 12:59:19 +0530
Subject: [PATCH 1/3] Improve acounting for memory used by shared hash tables

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory.  The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.

This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.

This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.

This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunks removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 286 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 225 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..8e19134cb5 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags,
+												  init_size, true),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..0240a87139 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
+	bool		prealloc;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +538,31 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Calculate the number of elements to allocate
+	 *
+	 * For a shared hash table, preallocate the requested number of elements.
+	 * This reduces problems with run-time out-of-shared-memory conditions.
+	 *
+	 * For a private hash table, preallocate the requested number of elements
+	 * if it's less than our chosen nelem_alloc.  This avoids wasting space if
+	 * the caller correctly estimates a small table size.
+	 */
+	prealloc = (flags & HASH_SHARED_MEM) || (nelem < nelem_batch);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, prealloc);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +611,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -567,14 +623,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	if (!init_htab(hashp, nelem))
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
-	/*
-	 * For a shared hash table, preallocate the requested number of elements.
-	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
-	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
@@ -582,6 +630,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +657,32 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets. We need to recalculate the number of segments, which we
+		 * don't store anywhere.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +770,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +788,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -846,16 +896,76 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * hash_get_init_size -- determine memory needed for a new dynamic hash table
+ *
+ *	info: hash table parameters
+ *	flags: bitmask indicating which parameters to take from *info
+ *	nelem: maximum number of elements expected
+ *	prealloc: request preallocation of elements
+ *
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. We need space for the HASHHDR, for the directory, segments and
+ * preallocated elements.
+ *
+ * The hash table may be private or shared. For shared hash tables the directory
+ * size is non-expansive, and we preallocate all elements (nelem). For private
+ * hash tables, we preallocate elements only if the expected number of elements
+ * is small (less than nelem_alloc).
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long nelem, bool prealloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+	long		nelem_prealloc = 0;
+
+#ifdef USE_ASSERT_CHECKING
+	/* shared hash tables have non-expansive directory */
+	/* XXX what about segment size? should check have HASH_SEGMENT? */
+	if (flags & HASH_SHARED_MEM)
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+		Assert(prealloc);
+	}
+#endif
+
+	/* Non-shared hash tables may not specify dir size */
+	if (flags & HASH_DIRSIZE)
+		dsize = info->dsize;
+	else
+		dsize = DEF_DIRSIZE;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	/* initial_elems as false indicates no elements are to be pre-allocated */
+	if (prealloc)
+		nelem_prealloc = nelem;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ nelem_prealloc * elementSize;
 }
 
 
@@ -1285,7 +1395,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1432,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1811,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1869,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2080,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5dc74db6e9 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long nelem, bool prealloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

v9-0002-Improve-accounting-for-PredXactList-RWConflictPool-a.patchapplication/octet-stream; name=v9-0002-Improve-accounting-for-PredXactList-RWConflictPool-a.patchDownload

From 14487eb068f434de48411727a32310bc9c3ef933 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Tue, 1 Apr 2025 13:01:38 +0530
Subject: [PATCH 2/3] Improve accounting for PredXactList, RWConflictPool and
 PGPROC

Various places allocated shared memory by first allocating a small chunk
using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more
memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this
affected pg_shmem_allocations - it only shown the initial chunk.

This commit modifies the following allocations, to allocate everything
as a single chunk, and then split it internally.

- PredXactList
- RWConflictPool
- PGPROC structures
- Fast-Path lock arrays

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/lmgr/predicate.c | 30 ++++++++-----
 src/backend/storage/lmgr/proc.c      | 63 +++++++++++++++++++++-------
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..d82114ffca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,21 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1249,9 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,20 +1305,25 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 5;
 
+	requestSize = RWConflictPoolHeaderDataSize +
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
+
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 066319afe2..ba1ac7e486 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,25 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * Report shared-memory space needed by PGPROC.
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +194,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,8 +225,17 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
-	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+						  requestSize,
+						  &found);
+
+	MemSet(ptr, 0, requestSize);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
@@ -213,17 +243,18 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
-	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->xids = (TransactionId *) ptr;
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +264,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +363,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.34.1

v9-0003-Add-cacheline-padding-between-heavily-accessed-array.patchapplication/octet-stream; name=v9-0003-Add-cacheline-padding-between-heavily-accessed-array.patchDownload

From 78a5cdd3d04498eeb11ec818913131d9b69613fb Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Tue, 1 Apr 2025 13:06:57 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ba1ac7e486..c39b51127a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,9 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -234,7 +234,7 @@ InitProcGlobal(void)
 	MemSet(ptr, 0, requestSize);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -245,10 +245,10 @@ InitProcGlobal(void)
 	 * PROC_HDR.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.34.1

#16

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Rahila Syed (#15)

Re: Improve monitoring of shared memory allocations

Thanks!

I've pushed the first two parts, improving the shmem accounting.

I've done a couple additional adjustments before commit, namely:

1) I've renamed hash_get_init_size() to just hash_get_size(). We've not
indicated that it's "initial" allocation before, I don't think we need
to indicate to do that now. So I've just removed the "shared" part, as
that's no longer relevant.

2) I've removed one of the compute_buckets_and_segs() in hash_create,
because AFAICS we don't need to recalculate the nsegs - we already have
the correct value in the hctl, so I just used that.

The fact that we call compute_buckets_and_segs() in multiple places,
relying on getting the exact same result seemed a bit fragile to me.
With this we now have just two callers (instead of 3), and I don't think
we can do better - the first call is in hash_get_size(), and I don't see
a convenient way to pass the info init_hash() where we do the 2nd call.

3) In 0002, I've reworked how ProcGlobalShmemSize() calculates the size
of the required memory - the earlier patch added PGProcShmemSize(), but
the ProcGlobalShmemSize() redid the calculation all over. I changed that
to call the new function, and also added a new FastPathLockShmemSize()
to do the same for the fast-path lock arrays (which I realized is now
allocated as a separate area, so it'll be shown separately).

4) I realized PGProcShmemSize() did the calculation slightly differently
from ProcGlobalShmemSize() - by directly multiplying the values, not
using mul_size(). I don't recall why we use mul_size(), but it seems
like the "usual" way. So I did it that way.

I notice we still do the plain multiplication later when setting the
ProcGlobal fields. Seems a bit weird, but we always did that - the patch
does not really change that.

I'll now mark this as committed. I haven't done about the alignment. My
conclusion from the discussion was we don't quite need to do that, but
if we do I think it's a matter for a separate patch - perhaps something
like the 0003.

Thanks for the patch, reviews, etc.

--
Tomas Vondra

#17

Rahila Syed

rahilasyed90@gmail.com

9 months ago

In reply to: Tomas Vondra (#16)

1 attachment(s)

Re: Improve monitoring of shared memory allocations

Hi,

Analysis of the Bug
</messages/by-id/3d36f787-8ba5-4aa2-abb1-7ade129a7b0e@vondra.me>
in 0002 reported by David Rowley :
The 0001* patch allocates memory for the hash header, directory, segments,
and elements collectively for both shared and non-shared hash tables. While
this approach works well for shared hash tables, it presents an issue for
non-
shared hash tables. Specifically, during the expand_table() process,
non-shared hash tables may reallocate a new directory and free the old one.
Since the directory's memory is no longer allocated individually, it cannot
be freed
separately. This results in the following error:

ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header
0x0000002000000008)

These allocations are done together to improve reporting of shared memory
allocations
for shared hash tables. Similar change is done for non-shared hash tables
only to maintain
consistency since hash_create code is shared between both types of hash
tables.

One solution could be separating allocation of directory from rest of the
allocations for
the non-shared hash tables, but this approach would undermine the purpose
of
doing the change for a non-shared hash table.

A better/safer solution would be to do this change only for shared hash
tables and
exclude the non-shared hash tables.

I believe it's acceptable to allocate everything in a single block provided
we are not trying
to individually free any of these, which we do only for the directory
pointer in dir_realloc.
Additional segment allocation goes through seg_alloc and element
allocations through
element_alloc, which do not free existing chunks but instead allocate new
ones with
pointers in existing directories and segments.
Thus, as long as we don't reallocate the directory, which we avoid in the
case of shared
hash tables, it should be safe to proceed with this change.

Please find attached the patch which removes the changes for non-shared
hash tables
and keeps them for shared hash tables.

I tested this by running make-check, make-check world and the reproducer
script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.

Thank you,
Rahila Syed

Attachments:

v10-0001-Improve-accounting-for-memory-used-by-shared-hash-ta.patchapplication/octet-stream; name=v10-0001-Improve-accounting-for-memory-used-by-shared-hash-ta.patchDownload

From 270ddf3e8e1fe99e92c45716f5d57396ee482d11 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed.90@gmail.com>
Date: Fri, 4 Apr 2025 14:22:05 +0530
Subject: [PATCH] Improve accounting for memory used by shared hash tables

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory.  The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.

This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.u

This does not change anything for non-shared hash tables.

This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunk removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 242 ++++++++++++++++++++++++------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 198 insertions(+), 51 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..a1b9180a64b 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_shared_size(infoP, hash_flags,
+													init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..53b84db0683 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,39 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+/* access to parts of the hash table, allocated as a single chunk */
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -280,6 +307,9 @@ static int	next_pow2_int(long num);
 static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
 
 
 /*
@@ -568,12 +598,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
 	/*
+	 * For a private hash table, preallocate the requested number of elements
+	 * if it's less than our chosen nelem_alloc.  This avoids wasting space if
+	 * the caller correctly estimates a small table size.
+	 *
 	 * For a shared hash table, preallocate the requested number of elements.
 	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
 	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
@@ -582,6 +612,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +637,42 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * For a shared hash table, calculate the offset at which to find the
+		 * first partition of elements. We have to skip space for the header,
+		 * segments and buckets.
+		 */
+		if (hashp->isshared)
+			ptr = HASH_ELEMENTS_PTR(hashp, hctl->nsegs);
+
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for shared
+			 * hash tables.
+			 *
+			 * We just need to split that allocation into per-batch freelists.
+			 */
+			if (hashp->isshared)
+			{
+				element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+				ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
+			}
+			else
+			{
+				HASHELEMENT *firstElement = element_alloc(hashp, temp);
+
+				if (!firstElement)
+					ereport(ERROR,
+							(errcode(ERRCODE_OUT_OF_MEMORY),
+							 errmsg("out of memory")));
+				element_add(hashp, firstElement, temp, i);
+			}
 		}
 	}
 
@@ -702,29 +761,16 @@ init_htab(HTAB *hashp, long nelem)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
 	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
+	 * We've already calculated these parameters when we calculated how much
+	 * space to allocate in hash_get_shared_size(). Be careful to keep these
+	 * two places in sync, so that we get the same parameters.
 	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -748,12 +794,22 @@ init_htab(HTAB *hashp, long nelem)
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		/* Assign initial segments, which are also pre-allocated */
+		if (hashp->isshared)
+		{
+			*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+			MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
+		}
+		else
+		{
+			*segp = seg_alloc(hashp);
+			i++;
+		}
 	}
+	Assert(i == nsegs);
 
 	/* Choose number of entries to allocate at a time */
 	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
@@ -846,16 +902,60 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * hash_get_shared_size -- determine memory needed for a new shared dynamic hash table
+ *
+ *	info: hash table parameters
+ *	flags: bitmask indicating which parameters to take from *info
+ *	nelem: maximum number of elements expected
+ *
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. We need space for the HASHHDR, for the directory, segments and
+ * preallocated elements.
+ *
+ * For shared hash tables the directory size is non-expansive, and we preallocate
+ * all elements (nelem).
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_shared_size(const HASHCTL *info, int flags, long nelem)
 {
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+
+#ifdef USE_ASSERT_CHECKING
+	/* shared hash tables have non-expansive directory */
+	Assert(flags & HASH_SHARED_MEM);
 	Assert(flags & HASH_DIRSIZE);
 	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+#endif
+
+	dsize = info->dsize;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ nelem * elementSize;
 }
 
 
@@ -1285,7 +1385,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1422,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1801,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1859,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2070,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..5cefda1c22c 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(const HASHCTL *info, int flags,
+								 long nelem);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1

#18

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Rahila Syed (#17)

Re: Improve monitoring of shared memory allocations

Hi,

I'm sorry, but I'm afraid this won't make it into PG18 :-(

AFAICS the updated patch is correct / not buggy for non-shared hash
tables, but considering I missed a pretty serious flaw before pushing
the original patch, and that the tests didn't catch that either ...

The risk/benefit is not really worth it for PG18. I think this needs a
bit more work/testing. The dynahash is a bit too critical piece, I don't
want to be reverting a patch a second time.

Sorry, I realize it's not a great outcome ...

On 4/4/25 17:45, Rahila Syed wrote:

Hi,

Analysis of the Bug <https://www.postgresql.org/message-
id/3d36f787-8ba5-4aa2-abb1-7ade129a7b0e%40vondra.me> in 0002 reported by
David Rowley :
The 0001* patch allocates memory for the hash header, directory, segments,
and elements collectively for both shared and non-shared hash tables. While
this approach works well for shared hash tables, it presents an issue
for non-
shared hash tables. Specifically, during the expand_table() process,
non-shared hash tables may reallocate a new directory and free the old one.
Since the directory's memory is no longer allocated individually, it
cannot be freed
separately. This results in the following error:

ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header
0x0000002000000008)

These allocations are done together to improve reporting of shared
memory allocations
for shared hash tables. Similar change is done for non-shared hash
tables only to maintain
consistency since hash_create code is shared between both types of hash
tables.

Thanks, that matches my conclusions after looking into this.

One solution could be separating allocation of directory from rest of
the allocations for
the non-shared hash tables, but this approach would undermine the
purpose of
doing the change for a non-shared hash table.

I don't follow. Why would that undermine the change for non-shared hash
tables? Why should this affect non-shared hash tables at all? The goal
was to improve accounting for shared hash tables.

A better/safer solution would be to do this change only for shared hash
tables and
exclude the non-shared hash tables.

I believe it's acceptable to allocate everything in a single block
provided we are not trying
to individually free any of these, which we do only for the directory
pointer in dir_realloc.
Additional segment allocation goes through seg_alloc and element
allocations through
element_alloc, which do not free existing chunks but instead allocate
new ones with
pointers in existing directories and segments.
Thus, as long as we don't reallocate the directory, which we avoid in
the case of shared
hash tables, it should be safe to proceed with this change.

Right, I believe that's correct. But I admit I find the current dynahash
code a bit confusing, especially in how it mixes handling of non-shared
and shared hash tables. (Not the fault of this patch, ofc.)

Please find attached the patch which removes the changes for non-shared
hash tables
and keeps them for shared hash tables.

I tested this by running make-check, make-check world and the reproducer
script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.

Thanks. I'm afraid the existing regression tests can't tell us much,
considering it didn't fail once with the reverted patch :-(

I did check the coverage in:

https://coverage.postgresql.org/src/backend/utils/hash/dynahash.c.gcov.html

and sure enough, dir_realloc() is not executed once. And there's a
couple more pieces that have the same issue (e.g. hash_freeze or parts
of get_hash_entry).

I realize that's not the fault of this patch, but would you be willing
to improve that? It'd certainly make me less concerned if we improved
this first.

regards

--
Tomas Vondra

#19

Amit Kapila

amit.kapila16@gmail.com

8 months ago

In reply to: Rahila Syed (#17)

Re: Improve monitoring of shared memory allocations

On Fri, Apr 4, 2025 at 9:15 PM Rahila Syed <rahilasyed90@gmail.com> wrote:

Please find attached the patch which removes the changes for non-shared hash tables
and keeps them for shared hash tables.

  /* Allocate initial segments */
+ i = 0;
  for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
  {
- *segp = seg_alloc(hashp);
- if (*segp == NULL)
- return false;
+ /* Assign initial segments, which are also pre-allocated */
+ if (hashp->isshared)
+ {
+ *segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+ MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
+ }
+ else
+ {
+ *segp = seg_alloc(hashp);
+ i++;
+ }

In the non-shared hash table case, previously, we used to return false
from here when seg_alloc() fails, but now it seems to be proceeding
without returning false. Is there a reason for the same?

I tested this by running make-check, make-check world and the reproducer script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.

This covers the basic tests for this patch. I think we should do some
low-level testing of both shared and non-shared hash tables by having
a contrib module or such (we don't need to commit such a contrib
module, but it will give us confidence that the low-level data
structure allocation change is thoroughly tested). We also need to
focus on negative tests where there is insufficient memory in the
system.

--
With Regards,
Amit Kapila.