Shared memory size computation oversight?

Started by Julien Rouhaudalmost 5 years ago16 messages
#1Julien Rouhaud
rjuju123@gmail.com
1 attachment(s)

Hi,

While investigating on some strange "out of shared memory" error reported on
the french BBS [1]https://forums.postgresql.fr/viewtopic.php?pid=32138#p32138 sorry, it's all in French., I noticed that 09adc9a8c09 (adding Robert in Cc) changed
ShmemAlloc alignment to CACHELINEALIGN but didn't update any related code that
depended on it.

Most of the core code isn't impacted as it doesn't have to reserve any shared
memory, but AFAICT pg_prewarm and pg_stat_statements can now slightly
underestimate the amount of shared memory they'll use, and similarly for any
third party extension that still rely on MAXALIGN alignment. As a consequence
those extension can consume a few hundred bytes more than they reserved, which
probably will be borrowed from the lock hashtables reserved size that isn't
alloced immediately. This can later lead to ShmemAlloc failing when trying to
acquire a lock while the hash table is almost full but should still have enough
room to store it, which could explain the error reported.

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.

[1]: https://forums.postgresql.fr/viewtopic.php?pid=32138#p32138 sorry, it's all in French.
in French.

Attachments:

v1-0001-Fix-various-shared-memory-computation.patchtext/x-diff; charset=us-asciiDownload
From 59f4959f3b8f7e69fdbe88a85efaca80b6718a38 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Sat, 27 Feb 2021 15:45:29 +0800
Subject: [PATCH v1] Fix various shared memory computation

Oversight in 09adc9a8c09.
---
 contrib/pg_prewarm/autoprewarm.c                | 2 +-
 contrib/pg_stat_statements/pg_stat_statements.c | 2 +-
 src/backend/utils/hash/dynahash.c               | 8 ++++----
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b3f73ea92d..887e68b288 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -138,7 +138,7 @@ _PG_init(void)
 
 	EmitWarningsOnPlaceholders("pg_prewarm");
 
-	RequestAddinShmemSpace(MAXALIGN(sizeof(AutoPrewarmSharedState)));
+	RequestAddinShmemSpace(CACHELINEALIGN(sizeof(AutoPrewarmSharedState)));
 
 	/* Register autoprewarm worker, if enabled. */
 	if (autoprewarm)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..4e045ffba7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1933,7 +1933,7 @@ pgss_memsize(void)
 {
 	Size		size;
 
-	size = MAXALIGN(sizeof(pgssSharedState));
+	size = CACHELINEALIGN(sizeof(pgssSharedState));
 	size = add_size(size, hash_estimate_size(pgss_max, sizeof(pgssEntry)));
 
 	return size;
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c7..e532a3825a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -797,19 +797,19 @@ hash_estimate_size(long num_entries, Size entrysize)
 		nDirEntries <<= 1;		/* dir_alloc doubles dsize at each call */
 
 	/* fixed control info */
-	size = MAXALIGN(sizeof(HASHHDR));	/* but not HTAB, per above */
+	size = CACHELINEALIGN(sizeof(HASHHDR));	/* but not HTAB, per above */
 	/* directory */
 	size = add_size(size, mul_size(nDirEntries, sizeof(HASHSEGMENT)));
 	/* segments */
 	size = add_size(size, mul_size(nSegments,
-								   MAXALIGN(DEF_SEGSIZE * sizeof(HASHBUCKET))));
+								   CACHELINEALIGN(DEF_SEGSIZE * sizeof(HASHBUCKET))));
 	/* elements --- allocated in groups of choose_nelem_alloc() entries */
 	elementAllocCnt = choose_nelem_alloc(entrysize);
 	nElementAllocs = (num_entries - 1) / elementAllocCnt + 1;
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
 	size = add_size(size,
-					mul_size(nElementAllocs,
-							 mul_size(elementAllocCnt, elementSize)));
+					CACHELINEALIGN(Nmul_size(nElementAllocs,
+								   mul_size(elementAllocCnt, elementSize))));
 
 	return size;
 }
-- 
2.30.1

#2Georgios
gkokolatos@protonmail.com
In reply to: Julien Rouhaud (#1)
Re: Shared memory size computation oversight?

Hi,

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, February 27, 2021 9:08 AM, Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

While investigating on some strange "out of shared memory" error reported on
the french BBS [1], I noticed that 09adc9a8c09 (adding Robert in Cc) changed
ShmemAlloc alignment to CACHELINEALIGN but didn't update any related code that
depended on it.

Nice catch.

Most of the core code isn't impacted as it doesn't have to reserve any shared
memory, but AFAICT pg_prewarm and pg_stat_statements can now slightly
underestimate the amount of shared memory they'll use, and similarly for any
third party extension that still rely on MAXALIGN alignment. As a consequence
those extension can consume a few hundred bytes more than they reserved, which
probably will be borrowed from the lock hashtables reserved size that isn't
alloced immediately. This can later lead to ShmemAlloc failing when trying to
acquire a lock while the hash table is almost full but should still have enough
room to store it, which could explain the error reported.

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.

Please excuse me as I speak most from the side of ignorance. I am slightly curious
to understand something in your patch, if you can be kind enough to explain it to
me.

The commit 09adc9a8c09 you pointed out to, as far as I can see, changed the total
size of the shared memory, not individual bits. It did explain that the users of
that space had properly padded their data, but even assuming that they did not do
that as asked, the result would (or rather should) be cache misses, not running out
of reserved memory.

My limited understanding is also based in a comment in CreateSharedMemoryAndSemaphores()

/*
* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.
*
* We take some care during this phase to ensure that the total size
* request doesn't overflow size_t. If this gets through, we don't
* need to be so careful during the actual allocation phase.
*/
size = 100000;

Of course, my argument falls bare, if the size estimates of each of the elements is
rather underestimated. Indeed this is the argument in the present patch expressed in
code in hash_estimate_size most prominently, here:

        size = add_size(size,
-                                       mul_size(nElementAllocs,
-                                                        mul_size(elementAllocCnt, elementSize)));
+                                       CACHELINEALIGN(Nmul_size(nElementAllocs,
+                                                                  mul_size(elementAllocCnt, elementSize))));

(small note, Nmul_size is a typo of mul_size, but that is minor, by amending it the
code compiles).

To conclude, the running out of shared memory, seems to me to be fixed rather
vaguely with this patch. I am not claiming that increasing the memory used by
the elements is not needed, I am claiming that I can not clearly see how is that
specific increase needed.

Thank you for your patience,
//Georgios -- https://www.vmware.com

Show quoted text

[1] https://forums.postgresql.fr/viewtopic.php?pid=32138#p32138 sorry, it's all
in French.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Georgios (#2)
Re: Shared memory size computation oversight?

Georgios <gkokolatos@protonmail.com> writes:

My limited understanding is also based in a comment in CreateSharedMemoryAndSemaphores()

* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.

Right. That 100K slop factor is capable of hiding a multitude of sins.

I have not looked at this patch, but I think the concern is basically that
if we have space-estimation infrastructure that misestimates what it is
supposed to estimate, then if that infrastructure is used to estimate the
size of any of the "big hog" data structures, we might misestimate by
enough that the slop factor wouldn't hide it.

regards, tom lane

#4Julien Rouhaud
rjuju123@gmail.com
In reply to: Georgios (#2)
Re: Shared memory size computation oversight?

On Wed, Mar 03, 2021 at 03:37:20PM +0000, Georgios wrote:

Please excuse me as I speak most from the side of ignorance. I am slightly curious
to understand something in your patch, if you can be kind enough to explain it to
me.

The commit 09adc9a8c09 you pointed out to, as far as I can see, changed the total
size of the shared memory, not individual bits. It did explain that the users of
that space had properly padded their data, but even assuming that they did not do
that as asked, the result would (or rather should) be cache misses, not running out
of reserved memory.

It actually changed the allocation of individual bits inside the fixed size
shared memory, as underlying users ends up calling shmemalloc(),
but did not accurately changed the actual total size of shared memory as many
estimates are done either still using MAXALIGN or simply not accounting for the
padding.

My limited understanding is also based in a comment in CreateSharedMemoryAndSemaphores()

/*
* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.
*
* We take some care during this phase to ensure that the total size
* request doesn't overflow size_t. If this gets through, we don't
* need to be so careful during the actual allocation phase.
*/
size = 100000;

Of course, my argument falls bare, if the size estimates of each of the elements is
rather underestimated. Indeed this is the argument in the present patch expressed in
code in hash_estimate_size most prominently, here:

size = add_size(size,
-                                       mul_size(nElementAllocs,
-                                                        mul_size(elementAllocCnt, elementSize)));
+                                       CACHELINEALIGN(Nmul_size(nElementAllocs,
+                                                                  mul_size(elementAllocCnt, elementSize))));

(small note, Nmul_size is a typo of mul_size, but that is minor, by amending it the
code compiles).

Oops, I'll fix that.

To conclude, the running out of shared memory, seems to me to be fixed rather
vaguely with this patch. I am not claiming that increasing the memory used by
the elements is not needed, I am claiming that I can not clearly see how is that
specific increase needed.

I'm also not entirely convinced that those fixes are enough to explain the "out
of shared memory" issue originally reported, but that's the only class of
problem that I could spot which could possibly explain it.

As Robert initially checked, it should be at worse a 6kB understimate with
default parameters (it may be slightly more now as we have more shared memory
users, but it should be the same order of magnitude), and I don't think that
there it will vary a lot with huge shared_buffers and/or max_connections.

Those 6kB should be compared to how much room is giving the initial 100k vs how
much the small stuff is actually computing.

Note that there are still some similar issue in the code. A quick look at
ProcGlobalShmemSize() vs InitProcGlobal() seems to indicate that it's missing 5
CACHELINEALIGN in the estimate. Unless someone object, I'll soon do a full
review of all the estimates in CreateSharedMemoryAndSemaphores() and fix all
additional occurences that I can find.

#5Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#3)
Re: Shared memory size computation oversight?

On Wed, Mar 03, 2021 at 11:23:54AM -0500, Tom Lane wrote:

Georgios <gkokolatos@protonmail.com> writes:

My limited understanding is also based in a comment in CreateSharedMemoryAndSemaphores()

* Size of the Postgres shared-memory block is estimated via
* moderately-accurate estimates for the big hogs, plus 100K for the
* stuff that's too small to bother with estimating.

Right. That 100K slop factor is capable of hiding a multitude of sins.

I have not looked at this patch, but I think the concern is basically that
if we have space-estimation infrastructure that misestimates what it is
supposed to estimate, then if that infrastructure is used to estimate the
size of any of the "big hog" data structures, we might misestimate by
enough that the slop factor wouldn't hide it.

Exactly. And now that I looked deeper I can see that multiple estimates are
entirely ignoring the padding alignment (e.g. ProcGlobalShmemSize()), which can
exceed the 6kB originally estimated by Robert.

#6Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#5)
Re: Shared memory size computation oversight?

On Thu, Mar 04, 2021 at 12:40:52AM +0800, Julien Rouhaud wrote:

On Wed, Mar 03, 2021 at 11:23:54AM -0500, Tom Lane wrote:

I have not looked at this patch, but I think the concern is basically that
if we have space-estimation infrastructure that misestimates what it is
supposed to estimate, then if that infrastructure is used to estimate the
size of any of the "big hog" data structures, we might misestimate by
enough that the slop factor wouldn't hide it.

Exactly. And now that I looked deeper I can see that multiple estimates are
entirely ignoring the padding alignment (e.g. ProcGlobalShmemSize()), which can
exceed the 6kB originally estimated by Robert.

We are going to have a hard time catching up all the places that are
doing an incorrect estimation, and have an even harder time making
sure that similar errors don't happen in the future. Should we add a
{add,mul}_shmem_size() to make sure that the size calculated is
correctly aligned, that uses CACHELINEALIGN before returning the
result size?
--
Michael

#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#6)
Re: Shared memory size computation oversight?

On Thu, Mar 04, 2021 at 04:05:10PM +0900, Michael Paquier wrote:

On Thu, Mar 04, 2021 at 12:40:52AM +0800, Julien Rouhaud wrote:

On Wed, Mar 03, 2021 at 11:23:54AM -0500, Tom Lane wrote:

I have not looked at this patch, but I think the concern is basically that
if we have space-estimation infrastructure that misestimates what it is
supposed to estimate, then if that infrastructure is used to estimate the
size of any of the "big hog" data structures, we might misestimate by
enough that the slop factor wouldn't hide it.

Exactly. And now that I looked deeper I can see that multiple estimates are
entirely ignoring the padding alignment (e.g. ProcGlobalShmemSize()), which can
exceed the 6kB originally estimated by Robert.

We are going to have a hard time catching up all the places that are
doing an incorrect estimation, and have an even harder time making
sure that similar errors don't happen in the future. Should we add a
{add,mul}_shmem_size() to make sure that the size calculated is
correctly aligned, that uses CACHELINEALIGN before returning the
result size?

I was also considering adding new (add|mull)_*_size functions to avoid having
too messy code. I'm not terribly happy with xxx_shm_size(), as not all call to
those functions would require an alignment. Maybe (add|mull)shmemalign_size?

But before modifying dozens of calls, should we really fix those or only
increase a bit the "slop factor", or a mix of it?

For instance, I can see that for instance BackendStatusShmemSize() never had
any padding consideration, while others do.

Maybe only fixing contribs, some macro like PredXactListDataSize that already
do a MAXALIGN, SimpleLruShmemSize and hash_estimate_size() would be a short
patch and should significantly improve the estimation.

#8Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#7)
Re: Shared memory size computation oversight?

On Thu, Mar 04, 2021 at 03:23:33PM +0800, Julien Rouhaud wrote:

I was also considering adding new (add|mull)_*_size functions to avoid having
too messy code. I'm not terribly happy with xxx_shm_size(), as not all call to
those functions would require an alignment. Maybe (add|mull)shmemalign_size?

But before modifying dozens of calls, should we really fix those or only
increase a bit the "slop factor", or a mix of it?

For instance, I can see that for instance BackendStatusShmemSize() never had
any padding consideration, while others do.

Maybe only fixing contribs, some macro like PredXactListDataSize that already
do a MAXALIGN, SimpleLruShmemSize and hash_estimate_size() would be a short
patch and should significantly improve the estimation.

The lack of complaints in this area looks to me like a sign that we
may not really need to backpatch something, so I would not be against
a precise chirurgy, with a separate set of {add,mul}_size() routines
that are used where adapted, so as it is easy to track down which size
estimations expect an extra padding. I would be curious to hear more
thoughts from others here.

Saying that, calling a new routine something like add_shmem_align_size
makes it clear what's its purpose.
--
Michael

#9Georgios
gkokolatos@protonmail.com
In reply to: Michael Paquier (#8)
Re: Shared memory size computation oversight?

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, March 4, 2021 9:21 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Mar 04, 2021 at 03:23:33PM +0800, Julien Rouhaud wrote:

I was also considering adding new (add|mull)_*_size functions to avoid having
too messy code. I'm not terribly happy with xxx_shm_size(), as not all call to
those functions would require an alignment. Maybe (add|mull)shmemalign_size?
But before modifying dozens of calls, should we really fix those or only
increase a bit the "slop factor", or a mix of it?
For instance, I can see that for instance BackendStatusShmemSize() never had
any padding consideration, while others do.
Maybe only fixing contribs, some macro like PredXactListDataSize that already
do a MAXALIGN, SimpleLruShmemSize and hash_estimate_size() would be a short
patch and should significantly improve the estimation.

The lack of complaints in this area looks to me like a sign that we
may not really need to backpatch something, so I would not be against
a precise chirurgy, with a separate set of {add,mul}_size() routines
that are used where adapted, so as it is easy to track down which size
estimations expect an extra padding. I would be curious to hear more
thoughts from others here.

Saying that, calling a new routine something like add_shmem_align_size
makes it clear what's its purpose.

My limited opinion on the matter after spending some time yesterday through
the related code, is that with the current api it is easy to miss something
and underestimate or just be sloppy. If only from the readability point of
view, adding the proposed add_shmem_align_size will be beneficial.

I hold no opinion on backpatching.

//Georgios

Show quoted text

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Michael

#10Julien Rouhaud
rjuju123@gmail.com
In reply to: Georgios (#9)
1 attachment(s)
Re: Shared memory size computation oversight?

On Thu, Mar 04, 2021 at 08:43:51AM +0000, Georgios wrote:

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, March 4, 2021 9:21 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Mar 04, 2021 at 03:23:33PM +0800, Julien Rouhaud wrote:

I was also considering adding new (add|mull)_*_size functions to avoid having
too messy code. I'm not terribly happy with xxx_shm_size(), as not all call to
those functions would require an alignment. Maybe (add|mull)shmemalign_size?
But before modifying dozens of calls, should we really fix those or only
increase a bit the "slop factor", or a mix of it?
For instance, I can see that for instance BackendStatusShmemSize() never had
any padding consideration, while others do.
Maybe only fixing contribs, some macro like PredXactListDataSize that already
do a MAXALIGN, SimpleLruShmemSize and hash_estimate_size() would be a short
patch and should significantly improve the estimation.

The lack of complaints in this area looks to me like a sign that we
may not really need to backpatch something, so I would not be against
a precise chirurgy, with a separate set of {add,mul}_size() routines
that are used where adapted, so as it is easy to track down which size
estimations expect an extra padding. I would be curious to hear more
thoughts from others here.

Saying that, calling a new routine something like add_shmem_align_size
makes it clear what's its purpose.

My limited opinion on the matter after spending some time yesterday through
the related code, is that with the current api it is easy to miss something
and underestimate or just be sloppy. If only from the readability point of
view, adding the proposed add_shmem_align_size will be beneficial.

I hold no opinion on backpatching.

So here's a v2 patch which hopefully account for all unaccounted alignment
padding. There's also a significant change in the shared hash table size
estimation. AFAICT, allocation will be done this way:

- num_freelists allocs of init_size / num_freelists entries (on average) for
partitioned htab, num_freelist being 1 for non partitioned table,
NUM_FREELISTS (32) otherwise.

- then the rest of the entries, if any, are alloced in groups of
choose_nelem_alloc() entries

With this patch applied, I see an extra 16KB of shared memory being requested.

Attachments:

v2-0001-Fix-various-shared-memory-estimates.patchtext/x-diff; charset=us-asciiDownload
From ddc03fbbebb8486d10b0fb0718bf6e4ed1291759 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Sat, 27 Feb 2021 15:45:29 +0800
Subject: [PATCH v2] Fix various shared memory estimates.

---
 contrib/pg_prewarm/autoprewarm.c              |  2 +-
 .../pg_stat_statements/pg_stat_statements.c   |  2 +-
 src/backend/access/common/syncscan.c          |  2 +-
 src/backend/access/nbtree/nbtutils.c          |  3 +-
 src/backend/access/transam/multixact.c        |  2 +-
 src/backend/access/transam/slru.c             |  2 +-
 src/backend/access/transam/twophase.c         |  2 +-
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/commands/async.c                  |  1 +
 src/backend/postmaster/autovacuum.c           |  3 +-
 src/backend/postmaster/bgworker.c             |  2 +-
 src/backend/postmaster/checkpointer.c         |  2 +-
 src/backend/postmaster/pgstat.c               | 17 ++---
 src/backend/postmaster/postmaster.c           |  2 +-
 src/backend/replication/logical/launcher.c    |  3 +-
 src/backend/replication/logical/origin.c      |  3 +-
 src/backend/replication/slot.c                |  2 +-
 src/backend/replication/walreceiverfuncs.c    |  2 +-
 src/backend/replication/walsender.c           |  2 +-
 src/backend/storage/buffer/buf_init.c         | 14 ++--
 src/backend/storage/buffer/buf_table.c        |  3 +-
 src/backend/storage/buffer/freelist.c         |  2 +-
 src/backend/storage/ipc/dsm.c                 |  2 +-
 src/backend/storage/ipc/pmsignal.c            |  2 +-
 src/backend/storage/ipc/procarray.c           |  8 +--
 src/backend/storage/ipc/procsignal.c          |  2 +-
 src/backend/storage/ipc/sinvaladt.c           |  2 +-
 src/backend/storage/lmgr/lock.c               | 16 ++++-
 src/backend/storage/lmgr/lwlock.c             |  2 +-
 src/backend/storage/lmgr/predicate.c          | 22 +++---
 src/backend/storage/lmgr/proc.c               | 12 ++--
 src/backend/utils/hash/dynahash.c             | 69 ++++++++++++++-----
 src/backend/utils/time/snapmgr.c              |  2 +-
 src/include/storage/predicate_internals.h     |  4 +-
 src/include/storage/shmem.h                   | 13 ++++
 src/include/utils/hsearch.h                   |  2 +
 36 files changed, 146 insertions(+), 87 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b3f73ea92d..887e68b288 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -138,7 +138,7 @@ _PG_init(void)
 
 	EmitWarningsOnPlaceholders("pg_prewarm");
 
-	RequestAddinShmemSpace(MAXALIGN(sizeof(AutoPrewarmSharedState)));
+	RequestAddinShmemSpace(CACHELINEALIGN(sizeof(AutoPrewarmSharedState)));
 
 	/* Register autoprewarm worker, if enabled. */
 	if (autoprewarm)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..4e045ffba7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1933,7 +1933,7 @@ pgss_memsize(void)
 {
 	Size		size;
 
-	size = MAXALIGN(sizeof(pgssSharedState));
+	size = CACHELINEALIGN(sizeof(pgssSharedState));
 	size = add_size(size, hash_estimate_size(pgss_max, sizeof(pgssEntry)));
 
 	return size;
diff --git a/src/backend/access/common/syncscan.c b/src/backend/access/common/syncscan.c
index b7a28af4ad..c19f9cab59 100644
--- a/src/backend/access/common/syncscan.c
+++ b/src/backend/access/common/syncscan.c
@@ -125,7 +125,7 @@ static BlockNumber ss_search(RelFileNode relfilenode,
 Size
 SyncScanShmemSize(void)
 {
-	return SizeOfScanLocations(SYNC_SCAN_NELEM);
+	return (Size) CACHELINEALIGN(SizeOfScanLocations(SYNC_SCAN_NELEM));
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d524310723..310ea904df 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2066,7 +2066,8 @@ BTreeShmemSize(void)
 
 	size = offsetof(BTVacInfo, vacuums);
 	size = add_size(size, mul_size(MaxBackends, sizeof(BTOneVacInfo)));
-	return size;
+
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1f9f1a1fa1..c789623c34 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1830,7 +1830,7 @@ MultiXactShmemSize(void)
 	add_size(offsetof(MultiXactStateData, perBackendXactIds) + sizeof(MultiXactId), \
 			 mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
 
-	size = SHARED_MULTIXACT_STATE_SIZE;
+	size = CACHELINEALIGN(SHARED_MULTIXACT_STATE_SIZE);
 	size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
 	size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
 
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 82149ad782..2271e092e9 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -168,7 +168,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
 
-	return BUFFERALIGN(sz) + BLCKSZ * nslots;
+	return (Size) CACHELINEALIGN(BUFFERALIGN(sz) + BLCKSZ * nslots);
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6c..fca70fabac 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -245,7 +245,7 @@ TwoPhaseShmemSize(void)
 	size = add_size(size, mul_size(max_prepared_xacts,
 								   sizeof(GlobalTransactionData)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..64de453340 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5096,7 +5096,7 @@ XLOGShmemSize(void)
 	 * routine again below to compute the actual allocation size.
 	 */
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 void
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4b16fb5682..6a0fd6ca10 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -513,6 +513,7 @@ AsyncShmemSize(void)
 	/* This had better match AsyncShmemInit */
 	size = mul_size(MaxBackends + 1, sizeof(QueueBackendStatus));
 	size = add_size(size, offsetof(AsyncQueueControl, backend));
+	size = (Size) CACHELINEALIGN(size);
 
 	size = add_size(size, SimpleLruShmemSize(NUM_NOTIFY_BUFFERS, 0));
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..7ddf86af74 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3414,7 +3414,8 @@ AutoVacuumShmemSize(void)
 	size = MAXALIGN(size);
 	size = add_size(size, mul_size(autovacuum_max_workers,
 								   sizeof(WorkerInfoData)));
-	return size;
+
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index bbbc09b0b5..d2f838eaed 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -148,7 +148,7 @@ BackgroundWorkerShmemSize(void)
 	size = add_size(size, mul_size(max_worker_processes,
 								   sizeof(BackgroundWorkerSlot)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 76f9f98ebb..6acd5e4c2b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -862,7 +862,7 @@ CheckpointerShmemSize(void)
 	size = offsetof(CheckpointerShmemStruct, requests);
 	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..b08f7f1d56 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2930,25 +2930,20 @@ BackendStatusShmemSize(void)
 	Size		size;
 
 	/* BackendStatusArray: */
-	size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+	size = mul_size_and_shmem_align(sizeof(PgBackendStatus), NumBackendStatSlots);
 	/* BackendAppnameBuffer: */
-	size = add_size(size,
-					mul_size(NAMEDATALEN, NumBackendStatSlots));
+	size = add_shmem_aligned_size(size, mul_size(NAMEDATALEN, NumBackendStatSlots));
 	/* BackendClientHostnameBuffer: */
-	size = add_size(size,
-					mul_size(NAMEDATALEN, NumBackendStatSlots));
+	size = add_shmem_aligned_size(size, mul_size(NAMEDATALEN, NumBackendStatSlots));
 	/* BackendActivityBuffer: */
-	size = add_size(size,
-					mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+	size = add_shmem_aligned_size(size, mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
 #ifdef USE_SSL
 	/* BackendSslStatusBuffer: */
-	size = add_size(size,
-					mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+	size = add_shmem_aligned_size(size, mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
 #endif
 #ifdef ENABLE_GSS
 	/* BackendGssStatusBuffer: */
-	size = add_size(size,
-					mul_size(sizeof(PgBackendGSSStatus), NumBackendStatSlots));
+	size = add_shmem_aligned_size(size, mul_size(sizeof(PgBackendGSSStatus), NumBackendStatSlots));
 #endif
 	return size;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index edab95a19e..716f7434e6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6425,7 +6425,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 Size
 ShmemBackendArraySize(void)
 {
-	return mul_size(MaxLivePostmasterChildren(), sizeof(Backend));
+	return mul_size_and_shmem_align(MaxLivePostmasterChildren(), sizeof(Backend));
 }
 
 void
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index cb462a052a..7d20877920 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -697,7 +697,8 @@ ApplyLauncherShmemSize(void)
 	size = MAXALIGN(size);
 	size = add_size(size, mul_size(max_logical_replication_workers,
 								   sizeof(LogicalRepWorker)));
-	return size;
+
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fddad..b9e9fa121a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -503,7 +503,8 @@ ReplicationOriginShmemSize(void)
 
 	size = add_size(size,
 					mul_size(max_replication_slots, sizeof(ReplicationState)));
-	return size;
+
+	return (Size) CACHELINEALIGN(size);
 }
 
 void
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..c399c491e9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -124,7 +124,7 @@ ReplicationSlotsShmemSize(void)
 	size = add_size(size,
 					mul_size(max_replication_slots, sizeof(ReplicationSlot)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 63e60478ea..a8cab3fa28 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -45,7 +45,7 @@ WalRcvShmemSize(void)
 
 	size = add_size(size, sizeof(WalRcvData));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /* Allocate and initialize walreceiver-related shared memory */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..d1c91cf354 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3055,7 +3055,7 @@ WalSndShmemSize(void)
 	size = offsetof(WalSndCtlData, walsnds);
 	size = add_size(size, mul_size(max_wal_senders, sizeof(WalSnd)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /* Allocate and initialize walsender-related shared memory */
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e9e4f35bb5..b92fdeef14 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -159,15 +159,13 @@ BufferShmemSize(void)
 	Size		size = 0;
 
 	/* size of buffer descriptors */
-	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_shmem_aligned_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 
 	/* size of data pages */
-	size = add_size(size, mul_size(NBuffers, BLCKSZ));
+	size = add_shmem_aligned_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
-	size = add_size(size, StrategyShmemSize());
+	size = add_shmem_aligned_size(size, StrategyShmemSize());
 
 	/*
 	 * It would be nice to include the I/O locks in the BufferDesc, but that
@@ -178,12 +176,10 @@ BufferShmemSize(void)
 	 * locks are not highly contended, we lay out the array with minimal
 	 * padding.
 	 */
-	size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_shmem_aligned_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
 
 	/* size of checkpoint sort array in bufmgr.c */
-	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+	size = add_shmem_aligned_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
 	return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae123..4200f99340 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -41,7 +41,8 @@ static HTAB *SharedBufHash;
 Size
 BufTableShmemSize(int size)
 {
-	return hash_estimate_size(size, sizeof(BufferLookupEnt));
+	return hash_estimate_size_ext(size, sizeof(BufferLookupEnt),
+								  size, true);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 6be80476db..33e9e74f40 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -459,7 +459,7 @@ StrategyShmemSize(void)
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_shmem_aligned_size(size, sizeof(BufferStrategyControl));
 
 	return size;
 }
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index b461a5f7e9..6ecf195c55 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -440,7 +440,7 @@ dsm_set_control_handle(dsm_handle h)
 size_t
 dsm_estimate_size(void)
 {
-	return 1024 * 1024 * (size_t) min_dynamic_shared_memory;
+	return (size_t) CACHELINEALIGN(1024 * 1024 * min_dynamic_shared_memory);
 }
 
 /*
diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 280c2395c9..d25080224a 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -124,7 +124,7 @@ PMSignalShmemSize(void)
 	size = add_size(size, mul_size(MaxLivePostmasterChildren(),
 								   sizeof(sig_atomic_t)));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4fc6ffb917..9b804f1152 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -358,6 +358,7 @@ ProcArrayShmemSize(void)
 
 	size = offsetof(ProcArrayStruct, pgprocnos);
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
+	size = CACHELINEALIGN(size);
 
 	/*
 	 * During Hot Standby processing we have a data structure called
@@ -377,11 +378,8 @@ ProcArrayShmemSize(void)
 
 	if (EnableHotStandby)
 	{
-		size = add_size(size,
-						mul_size(sizeof(TransactionId),
-								 TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-						mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
+		size = add_shmem_aligned_size(size, mul_size(sizeof(TransactionId), TOTAL_MAX_CACHED_SUBXIDS));
+		size = add_shmem_aligned_size(size, mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
 	}
 
 	return size;
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c6a8d4611e..2af7f3e343 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -113,7 +113,7 @@ ProcSignalShmemSize(void)
 
 	size = mul_size(NumProcSignalSlots, sizeof(ProcSignalSlot));
 	size = add_size(size, offsetof(ProcSignalHeader, psh_slot));
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 946bd8e3cb..8379bf12e7 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -207,7 +207,7 @@ SInvalShmemSize(void)
 	size = offsetof(SISeg, procState);
 	size = add_size(size, mul_size(sizeof(ProcState), MaxBackends));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 108b4d9023..f2fc8af5c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -54,7 +54,7 @@
 int			max_locks_per_xact; /* set by guc.c */
 
 #define NLOCKENTS() \
-	mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
+	mul_size_and_shmem_align(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))
 
 
 /*
@@ -3527,17 +3527,27 @@ LockShmemSize(void)
 
 	/* lock hash table */
 	max_table_size = NLOCKENTS();
-	size = add_size(size, hash_estimate_size(max_table_size, sizeof(LOCK)));
+	size = add_size(size, hash_estimate_size_ext(max_table_size, sizeof(LOCK),
+												 max_table_size / 2,
+												 true));
 
 	/* proclock hash table */
 	max_table_size *= 2;
-	size = add_size(size, hash_estimate_size(max_table_size, sizeof(PROCLOCK)));
+	size = add_size(size, hash_estimate_size_ext(max_table_size, sizeof(PROCLOCK),
+												 max_table_size / 2,
+												 true));
 
 	/*
 	 * Since NLOCKENTS is only an estimate, add 10% safety margin.
 	 */
 	size = add_size(size, size / 10);
 
+	/*
+	 * Note: we don't count fast-path structure, it comes out of the "slop
+	 * factor" added by CreateSharedMemoryAndSemaphores.  This lets us use this
+	 * routine again below to compute the actual allocation size.
+	 */
+
 	return size;
 }
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..6a47918616 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -456,7 +456,7 @@ LWLockShmemSize(void)
 	/* Disallow adding any more named tranches. */
 	lock_named_request_allowed = false;
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d493aeef0f..1dcadc00ae 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1359,13 +1359,17 @@ PredicateLockShmemSize(void)
 
 	/* predicate lock target hash table */
 	max_table_size = NPREDICATELOCKTARGETENTS();
-	size = add_size(size, hash_estimate_size(max_table_size,
-											 sizeof(PREDICATELOCKTARGET)));
+	size = add_size(size, hash_estimate_size_ext(max_table_size,
+												 sizeof(PREDICATELOCKTARGET),
+												 max_table_size,
+												 true));
 
 	/* predicate lock hash table */
 	max_table_size *= 2;
-	size = add_size(size, hash_estimate_size(max_table_size,
-											 sizeof(PREDICATELOCK)));
+	size = add_size(size, hash_estimate_size_ext(max_table_size,
+												 sizeof(PREDICATELOCK),
+												 max_table_size,
+												 true));
 
 	/*
 	 * Since NPREDICATELOCKTARGETENTS is only an estimate, add 10% safety
@@ -1377,8 +1381,7 @@ PredicateLockShmemSize(void)
 	max_table_size = MaxBackends + max_prepared_xacts;
 	max_table_size *= 10;
 	size = add_size(size, PredXactListDataSize);
-	size = add_size(size, mul_size((Size) max_table_size,
-								   PredXactListElementDataSize));
+	size = add_shmem_aligned_size(size, mul_size((Size) max_table_size, PredXactListElementDataSize));
 
 	/* transaction xid table */
 	size = add_size(size, hash_estimate_size(max_table_size,
@@ -1387,14 +1390,13 @@ PredicateLockShmemSize(void)
 	/* rw-conflict pool */
 	max_table_size *= 5;
 	size = add_size(size, RWConflictPoolHeaderDataSize);
-	size = add_size(size, mul_size((Size) max_table_size,
-								   RWConflictDataSize));
+	size = add_shmem_aligned_size(size, mul_size((Size) max_table_size, RWConflictDataSize));
 
 	/* Head for list of finished serializable transactions. */
-	size = add_size(size, sizeof(SHM_QUEUE));
+	size = add_shmem_aligned_size(size, sizeof(SHM_QUEUE));
 
 	/* Shared memory structures for SLRU tracking of old committed xids. */
-	size = add_size(size, sizeof(SerialControlData));
+	size = add_shmem_aligned_size(size, sizeof(SerialControlData));
 	size = add_size(size, SimpleLruShmemSize(NUM_SERIAL_BUFFERS, 0));
 
 	return size;
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 897045ee27..28031c7d22 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -106,13 +106,13 @@ ProcGlobalShmemSize(void)
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
 	/* ProcGlobal */
-	size = add_size(size, sizeof(PROC_HDR));
-	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
-	size = add_size(size, sizeof(slock_t));
+	size = add_shmem_aligned_size(size, sizeof(PROC_HDR));
+	size = add_shmem_aligned_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
+	size = add_shmem_aligned_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
-	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
-	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
+	size = add_shmem_aligned_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_shmem_aligned_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
+	size = add_shmem_aligned_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
 	return size;
 }
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c7..030ca3917c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -769,23 +769,43 @@ init_htab(HTAB *hashp, long nelem)
 	return true;
 }
 
+/*
+ * Wrapper around hash_estimate_size for non partitioned hash table and all
+ * entries preallocated.
+ */
+Size
+hash_estimate_size(long num_entries, Size entrysize)
+{
+	return hash_estimate_size_ext(num_entries, entrysize, num_entries, false);
+}
+
 /*
  * Estimate the space needed for a hashtable containing the given number
- * of entries of given size.
+ * of entries of given size, and given it's partitioned or not.
  * NOTE: this is used to estimate the footprint of hashtables in shared
  * memory; therefore it does not count HTAB which is in local memory.
  * NB: assumes that all hash structure parameters have default values!
  */
 Size
-hash_estimate_size(long num_entries, Size entrysize)
+hash_estimate_size_ext(long num_entries, Size entrysize, long init_entries,
+					   bool partitioned)
 {
-	Size		size;
+	Size		size,
+				segmentSize,
+				freelistAllocSize;
 	long		nBuckets,
 				nSegments,
 				nDirEntries,
-				nElementAllocs,
 				elementSize,
-				elementAllocCnt;
+				nelemFreelist;
+	int			nFreelists;
+
+	if (partitioned)
+		nFreelists = NUM_FREELISTS;
+	else
+		nFreelists = 1;
+
+	Assert(init_entries <= num_entries);
 
 	/* estimate number of buckets wanted */
 	nBuckets = next_pow2_long(num_entries);
@@ -797,19 +817,36 @@ hash_estimate_size(long num_entries, Size entrysize)
 		nDirEntries <<= 1;		/* dir_alloc doubles dsize at each call */
 
 	/* fixed control info */
-	size = MAXALIGN(sizeof(HASHHDR));	/* but not HTAB, per above */
+	size = CACHELINEALIGN(sizeof(HASHHDR));	/* but not HTAB, per above */
 	/* directory */
-	size = add_size(size, mul_size(nDirEntries, sizeof(HASHSEGMENT)));
+	size = add_shmem_aligned_size(size, mul_size(nDirEntries, sizeof(HASHSEGMENT)));
 	/* segments */
-	size = add_size(size, mul_size(nSegments,
-								   MAXALIGN(DEF_SEGSIZE * sizeof(HASHBUCKET))));
-	/* elements --- allocated in groups of choose_nelem_alloc() entries */
-	elementAllocCnt = choose_nelem_alloc(entrysize);
-	nElementAllocs = (num_entries - 1) / elementAllocCnt + 1;
+	segmentSize = CACHELINEALIGN(DEF_SEGSIZE * sizeof(HASHBUCKET));
+	size = add_size(size, mul_size(nSegments, segmentSize));
+
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
-	size = add_size(size,
-					mul_size(nElementAllocs,
-							 mul_size(elementAllocCnt, elementSize)));
+
+	/*
+	 * elements --- allocated in groups of choose_nelem_alloc() entries, except
+	 * for the first init_entries, allocated in num_freelists groups.
+	 */
+	nelemFreelist = init_entries / nFreelists;
+	freelistAllocSize = mul_size_and_shmem_align(nelemFreelist, elementSize);
+
+	size = add_size(size, mul_size(nFreelists, freelistAllocSize));
+
+	if (num_entries > init_entries)
+	{
+		Size	nelem,
+				nAlloc,
+				allocSize;
+
+		nelem = choose_nelem_alloc(entrysize);
+		nAlloc = (num_entries - init_entries - 1) / nelem + 1;
+		allocSize = mul_size_and_shmem_align(nelem, elementSize);
+
+		size = add_size(size, mul_size(nAlloc, allocSize));
+	}
 
 	return size;
 }
@@ -852,7 +889,7 @@ hash_get_shared_size(HASHCTL *info, int flags)
 {
 	Assert(flags & HASH_DIRSIZE);
 	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	return (Size) CACHELINEALIGN(sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT));
 }
 
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 95704265b6..b66278e466 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -203,7 +203,7 @@ SnapMgrShmemSize(void)
 		size = add_size(size, mul_size(sizeof(TransactionId),
 									   OLD_SNAPSHOT_TIME_MAP_ENTRIES));
 
-	return size;
+	return (Size) CACHELINEALIGN(size);
 }
 
 /*
diff --git a/src/include/storage/predicate_internals.h b/src/include/storage/predicate_internals.h
index 104f560d38..bbc6ca4cbf 100644
--- a/src/include/storage/predicate_internals.h
+++ b/src/include/storage/predicate_internals.h
@@ -192,7 +192,7 @@ typedef struct PredXactListData
 typedef struct PredXactListData *PredXactList;
 
 #define PredXactListDataSize \
-		((Size)MAXALIGN(sizeof(PredXactListData)))
+		((Size)CACHELINEALIGN(sizeof(PredXactListData)))
 
 
 /*
@@ -227,7 +227,7 @@ typedef struct RWConflictPoolHeaderData
 typedef struct RWConflictPoolHeaderData *RWConflictPoolHeader;
 
 #define RWConflictPoolHeaderDataSize \
-		((Size)MAXALIGN(sizeof(RWConflictPoolHeaderData)))
+		((Size)CACHELINEALIGN(sizeof(RWConflictPoolHeaderData)))
 
 
 /*
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 31024e5a50..897672bc8e 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -45,6 +45,19 @@ extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
+/*
+ * Macro around add_size and mul_size to account for extra padding as done in
+ * ShmemAlloc().
+ *
+ * add_shmem_aligned_size() will add the padding bytes to the 2nd argument
+ * mul_size_and_shmem_align will return the aligned size
+ */
+#define add_shmem_aligned_size(s1, s2)	add_size((s1), \
+		(Size) CACHELINEALIGN(s2))
+
+#define mul_size_and_shmem_align(s1, s2)	add_shmem_aligned_size(0, \
+		mul_size((s1), (s2)))
+
 /* ipci.c */
 extern void RequestAddinShmemSpace(Size size);
 
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8..20c0ff3036 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -145,6 +145,8 @@ extern void *hash_seq_search(HASH_SEQ_STATUS *status);
 extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
+extern Size hash_estimate_size_ext(long num_entries, Size entrysize,
+								   long init_entries, bool num_freelists);
 extern long hash_select_dirsize(long num_entries);
 extern Size hash_get_shared_size(HASHCTL *info, int flags);
 extern void AtEOXact_HashTables(bool isCommit);
-- 
2.30.1

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Georgios (#9)
1 attachment(s)
Re: Shared memory size computation oversight?

I spent a bit of time looking into this. Using current HEAD, I
instrumented CreateSharedMemoryAndSemaphores to log the size estimates
returned by the various estimation subroutines, plus the shared memory
space actually consumed (i.e, the change in ShmemSegHdr->freeoffset)
by the various shared memory initialization functions. There were only
two estimates that were way off: LockShmemSize requested 1651771 more
bytes than InitLocks actually consumed, and PredicateLockShmemSize
requested 167058 more bytes than InitPredicateLocks consumed. I believe
both of those are intentional, reflecting space that may be eaten by the
lock tables later.

Meanwhile, looking at ShmemSegHdr->freeoffset vs ShmemSegHdr->totalsize,
the actual remaining shmem space after postmaster startup is 1919488
bytes. (Running the core regression tests doesn't consume any of that
remaining space, btw.) Subtracting the extra lock-table space, we have
100659 bytes to spare, which is as near as makes no difference to the
intended slop of 100000 bytes.

My conclusion is that we don't need to do anything, indeed the proposed
changes will probably just lead to overestimation.

It's certainly possible that there's something amiss somewhere. These
numbers were all taken with out-of-the-box configuration, so it could be
that changing some postgresql.conf entries would expose that some module
is not scaling its request correctly. Also, I don't have any extensions
loaded, so this proves nothing about the correctness of any of those.
But it appears to me that the general scheme is working perfectly fine,
so we do not need to complicate it.

regards, tom lane

Attachments:

shmem-consumption.txttext/plain; charset=us-ascii; name=shmem-consumption.txtDownload
#12Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#11)
Re: Shared memory size computation oversight?

On Thu, Mar 04, 2021 at 12:53:54PM -0500, Tom Lane wrote:

My conclusion is that we don't need to do anything, indeed the proposed
changes will probably just lead to overestimation.

It's certainly possible that there's something amiss somewhere. These
numbers were all taken with out-of-the-box configuration, so it could be
that changing some postgresql.conf entries would expose that some module
is not scaling its request correctly. Also, I don't have any extensions
loaded, so this proves nothing about the correctness of any of those.
But it appears to me that the general scheme is working perfectly fine,
so we do not need to complicate it.

Thanks for looking at it. I agree that most of the changes aren't worth the
complication and risk to overestimate the shmem size.

But I did some more experiments comparing the current version and the patched
version of the lock and pred lock shared hash table size estimations with some
configuration changes, and I find the following delta in the estimated size:

- original configuration : 15kB
- max_connection = 1000 : 30kB
- max_connection = 1000 and max_locks_per_xact = 256 : 96kB

I don't know if my version is totally wrong or not, but if it's not it could be
worthwhile to apply the hash tab part as it would mean that it doesn't scale
properly.

#13Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Julien Rouhaud (#1)
Re: Shared memory size computation oversight?

On 27.02.21 09:08, Julien Rouhaud wrote:

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.

For extensions, wouldn't it make things better if
RequestAddinShmemSpace() applied CACHELINEALIGN() to its argument?

If you consider the typical sequence of RequestAddinShmemSpace(mysize())
and later ShmemInitStruct(..., mysize(), ...), then the size gets
rounded up to cache-line size in the second case and not the first. The
premise in your patch is that the extension should figure that out
before calling RequestAddinShmemSpace(), but that seems wrong.

Btw., I think your patch was wrong to apply CACHELINEALIGN() to
intermediate results rather than at the end.

#14Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#13)
Re: Shared memory size computation oversight?

On Thu, Aug 12, 2021 at 9:34 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 27.02.21 09:08, Julien Rouhaud wrote:

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.

For extensions, wouldn't it make things better if
RequestAddinShmemSpace() applied CACHELINEALIGN() to its argument?

If you consider the typical sequence of RequestAddinShmemSpace(mysize())
and later ShmemInitStruct(..., mysize(), ...),

But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would
only really work for that specific usage only? If an extension does
multiple allocations you can't rely on correct results.

then the size gets
rounded up to cache-line size in the second case and not the first. The
premise in your patch is that the extension should figure that out
before calling RequestAddinShmemSpace(), but that seems wrong.

Btw., I think your patch was wrong to apply CACHELINEALIGN() to
intermediate results rather than at the end.

I'm not sure why you think it's wrong. It's ShmemInitStruct() that
will apply the padding, so if the extension calls it multiple times
(whether directly or indirectly), then the padding will be added
multiple times. Which means that in theory the extension should
account for it multiple time in the amount of memory it's requesting.

But given the later discussion, ISTM that there's an agreement that
any number of CACHELINEALIGN() overhead for any number of allocation
can be ignored as it should be safely absorbed by the 100kB extra
space. I think that the real problem, as mentioned in my last email,
is that the shared htab size estimation doesn't really scale and can
easily eat the whole extra space if you use some not that unrealistic
parameters. I still don't know if I made any mistake trying to
properly account for the htab allocation, but if I didn't it can be
problematic.

#15Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Julien Rouhaud (#14)
Re: Shared memory size computation oversight?

On 12.08.21 16:18, Julien Rouhaud wrote:

On Thu, Aug 12, 2021 at 9:34 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 27.02.21 09:08, Julien Rouhaud wrote:

PFA a patch that fixes pg_prewarm and pg_stat_statements explicit alignment to
CACHELINEALIGN, and also update the alignment in hash_estimate_size() to what I
think ShmemInitHash will actually consume.

For extensions, wouldn't it make things better if
RequestAddinShmemSpace() applied CACHELINEALIGN() to its argument?

If you consider the typical sequence of RequestAddinShmemSpace(mysize())
and later ShmemInitStruct(..., mysize(), ...),

But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would
only really work for that specific usage only? If an extension does
multiple allocations you can't rely on correct results.

I think you can do different things here to create inconsistent results,
but I think there should be one common, standard, normal,
straightforward way to get a correct result.

Btw., I think your patch was wrong to apply CACHELINEALIGN() to
intermediate results rather than at the end.

I'm not sure why you think it's wrong. It's ShmemInitStruct() that
will apply the padding, so if the extension calls it multiple times
(whether directly or indirectly), then the padding will be added
multiple times. Which means that in theory the extension should
account for it multiple time in the amount of memory it's requesting.

Yeah, in that case it's probably rather the case that there is one
CACHELINEALIGN() too few, since pg_stat_statements does two separate
shmem allocations.

#16Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#15)
Re: Shared memory size computation oversight?

On Fri, Aug 13, 2021 at 10:52:50AM +0200, Peter Eisentraut wrote:

On 12.08.21 16:18, Julien Rouhaud wrote:

But changing RequestAddinShmemSpace() to apply CACHELINEALIGN() would
only really work for that specific usage only? If an extension does
multiple allocations you can't rely on correct results.

I think you can do different things here to create inconsistent results, but
I think there should be one common, standard, normal, straightforward way to
get a correct result.

Unless I'm missing something, the standard and straightforward way to get a
correct result is to account for padding bytes in the C code, as it's currently
done in all contrib modules. The issue is that those modules aren't using the
correct alignment anymore.

Btw., I think your patch was wrong to apply CACHELINEALIGN() to
intermediate results rather than at the end.

I'm not sure why you think it's wrong. It's ShmemInitStruct() that
will apply the padding, so if the extension calls it multiple times
(whether directly or indirectly), then the padding will be added
multiple times. Which means that in theory the extension should
account for it multiple time in the amount of memory it's requesting.

Yeah, in that case it's probably rather the case that there is one
CACHELINEALIGN() too few, since pg_stat_statements does two separate shmem
allocations.

I still don't get it. Aligning the total shmem size is totally different from
properly padding all single allocation sizes, and is almost never the right
answer.

Using a naive example, if your extension needs to ShmemInitStruct() twice 64B,
postgres will consume 2*128B. But if you only rely on RequestAddinShmemSpace()
to add a CACHELINEALIGN(), then no padding at all will be added, and you'll
then be not one but two CACHELINEALIGN() too few.

But again, the real issue is not the CACHELINEALIGN() roundings, as those have
a more or less stable size and are already accounted for in the extra 100kB,
but the dynahash size estimation which seems to be increasingly off as the
number of entries grows.